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(57) Abstract: The present 

invention relates to methods for 
diagnosing the metastatic potential 
of hepatocellular carcinoma (HCC) 
in HCC patients and methods 
for diagnosing the potential of 
developing HCC in patients with 
chronic liver diseases. A computer 
readable medium, a digital computer, 
and a system useful for such 
diagnosis are also provided. Further 
disclosed are methods for identifying 
potential therapeutic targets for 
treating metastasis in HCC patients 
and methods for preventing HCC in 
patients with chronic liver diseases. 
In addition, the invention provides 
methods for inhibiting metastasis 
in HCC patients by suppressing 
the function of one therapeutic 
target, osteopontin, and methods 
for preventing the development 
of HCC in patients with chronic 
liver diseases by suppressing 
the function of one therapeutic 
target, EpCAM. Pharmaceutical 
compositions containing agents 
capable of inhibiting the functions 
of osteopontin or EpCAM are also 
disclosed. 
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Methods of Diagnosing Potential for 
Metastasis or Developing Hepatocellular Carcinoma 
and of Identifying Therapeutic Targets 

5 CROSS-REFERENCES TO RELATED APPLICATIONS 

[0001 ] This application claims the benefit of priority to U.S. Provisional Patent Application 
No. 60/370,895, filed April 5, 2002, the entire contents of which are hereby incorporated by 
reference. 

1 0 STATEMENT AS TO RIGHTS TO INVENTIONS MADE UNDER 

FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT 
[0002] This invention is owned by the United States of America as represented by the 
Secretary of Health and Human Services. 

1 5 BACKGROUND OF THE INVENTION 

[0003] Hepatocellular carcinoma (HCC) is one of the most common and aggressive 
malignancies worldwide with a curable rate of less than 5%. The high mortality is mainly 
due to the occurrence of intra-hepatic metastases. Little is known about the molecular basis 
of intra-hepatic metastasis or about specific therapeutic targets in these patients. 

20 [0004J Within the past decade, several technologies have made it possible to monitor the 
expression level of a large number of transcripts at any one time (see, e.g., Schena et al., 
Science 270:467-470, 1995; Lockhart et al., Nature Biotechnology 14:1675-1680, 1996; 
Blanchard et al., Nature Biotechnology 14:1649, 1996; and U.S. Pat. No. 5,569,588). In 
organisms for which the complete genome is known, it is possible to analyze the transcripts 

25 of all genes within the cell. With other organisms, such as human, for which there is an 

increasing knowledge of the genome, it is possible to simultaneously monitor large numbers 
of the genes within the cell. Such monitoring technologies have been applied to the 
identification of genes which are up regulated or down regulated in various diseased or 
physiological states, the analyses of members of signaling cellular states, and the 

30 identification of targets for various drugs. 
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[0005] The present inventors analyzed the expression of 9,1 80 genes in HCC tissues from 
40 patients without or with accompanying intra-hepatic metastases. Using a supervised 
machine learning algorithm to classify patients based on their gene expression signatures, a 
molecular signature has been generated for the first time that correctly classifies patients with 
5 or without metastases and have identifies genes that are mostly relevant to the prediction of 
outcome including patient survival. The gene expression signature of primary HGCs with 
accompanying metastasis is very similar to that of their corresponding metastases, suggesting 
that the genes favoring metastasis progression likely have been initiated in the primary 
tumors. Moreover, osteopontin (OPN) is overexpressed in primary HCC with intra-hepatic 
10 metastasis and a neutralizing antibody against osteopontin is shown to block invasion of 

highly metastatic HCC cells in an in vitro assay of invasion. These data identify osteopontin 
both as a diagnostic marker and a therapeutic target for metastatic HCC. 

[0006] The expression of 9,180 genes has also been analyzed in tumor samples from 54 
HCC patients and in 59 non-cancerous liver samples from patients with severe liver diseases 

1 5 and at high risk for developing HCC or at low risk for developing HCC. The high risk group 
includes patients diagnosed with hepatitis B, hepatitis C, hemochromatosis, and Wilson's 
disease. The low risk group includes patients diagnosed with alcoholic liver disease, 
autoimmune hepatitis, and primary biliary cirrhosis. A comparison of the gene expression 
levels between the high risk and low risk groups has identified a set of significant genes that 

20 would differentiate between the high risk and low risk groups. Filtering the set of significant 
genes using expression data from HCC samples has identified subsets of genes enriched with 
HCC-related molecular signatures and useful for classifying samples. In addition, EpCAM is 
among the most significant genes whose overexpression positively correlates to the risk of 
developing HCC in a patient with a severe liver disease and the inhibition of its expression 

25 has been shown to lead to growth suppression in HCC cells. Thus, EpCAM has been 
identified as a diagnostic marker for predicting the risk of developing HCC as well as a 
therapeutic target for preventing the onset of HCC in patients suffering from chronic liver 
diseases. 

BRIEF SUMMARY OF THE INVENTION 

30 [0007] One aspect of the present invention relates to a method for identifying potential 

therapeutic targets for inhibiting metastasis in a patient suffering from HCC or for preventing 
the development of HCC in a patient suffering from a chronic liver disease. 
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[0008] The method for identifying potential therapeutic targets for inhibiting metastasis in 
an HCC patient includes the steps of: a) contacting an array comprising capture reagents for a 
set of cellular markers with a sample from a metastatic HCC patient; b) capturing markers 
from the sample and generating a first signal; c) repeating steps a) and b) with a sample from 
5 a non-metastatic HCC patient and thereby generating a second signal; and d) comparing the 
first and second signals and thereby identifying a subset of cellular markers whose level is 
different in the first and second signals, wherein the subset of cellular markers are potential 
therapeutic targets for treating HCC metastasis in an HCC patient. In some embodiments, a 
signal generated from a normal non-cancerous sample on an array identical to the array of 
1 0 step a) is subtracted in steps b) and c) to generate the first and second signals. 

[0009) The method for identifying potential therapeutic targets for preventing the onset of 
HCC in a patient with a chronic liver disease includes the steps of: a) contacting an array 
comprising capture reagents for a set of cellular markers with a sample from a patient with a 
chronic liver disease and a high risk of developing HCC; b) capturing markers from the 

1 5 sample and generating a first signal; c) repeating steps a) and b) with a sample from a patient 
with a chronic liver disease and a low risk of developing HCC and thereby generating a 
second signal; and d) comparing the first and second signals and thereby identifying a subset 
of cellular markers whose level is different in the first and second signals, wherein the subset 
of cellular markers are potential therapeutic target for preventing HCC in a patient with a 

20 chronic liver disease. In some embodiments, a signal generated from a normal non-cancerous 
sample on an array identical to the array of step a) is subtracted in steps b) and c) to generate 
the first and second signals. 

[001 0J Another aspect of the present invention relates to a method for predicting the 
metastatic potential in an HCC patient or for predicting the risk of developing HCC in a 
25 patient with a chronic liver disease. 

[0011] The method for predicting the metastatic potential in an HCC patient includes the 
steps of: a) contacting an array comprising capture reagents for a set of cellular markers with 
a sample from a metastatic HCC patient, the set of cellular markers comprising at least ten 
genes or proteins encoded by genes independently selected from the genes of Table 2; b) 
30 capturing markers from the sample; c) generating a first signal from the captured markers of 
step b); d) repeating steps a) to c) with a sample from a non-metastatic HCC patient and 
thereby generating a second signal; e) repeating steps a) to c) with a sample from an HCC 
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patient with unknown metastatic potential and thereby generating a third signal; and f) 
comparing the third signal to the first and the second signals and thereby determining the 
metastatic potential of the HCC patient of step e). In some embodiments, the set of cellular 
markers includes at least 20, preferably 50, more preferably 100, and most preferably all 
5 genes or proteins encoded by genes independently selected from the genes of Table 2. In 
other embodiments, the set of cellular markers includes the genes or proteins encoded by 
genes of Table 4 or Unigene numbers Hs.313, Hs.69707, Hs.222, Hs.63984, Hs.75573, 
Hs.177687, Hs.69707, Hs.222, Hs.323712, and Hs.63984. Preferably, the sample of steps a) 
and b), the sample of step d), and the sample of step e) are liver tissue extracts. In a preferred 
10 embodiment, the array of step a) is a genomic array. In another preferred embodiment, the 
array of step a) is a proteomic array. 

[0012] The method for predicting the risk of developing HCC in a patient suffering from a 
chronic liver disease includes the steps of: a) contacting an array comprising capture reagents 
for a set of cellular markers with a sample from a patient with a chronic liver disease and a 

1 5 high risk of HCC, the set of cellular markers comprising at least ten genes or proteins 

encoded by genes independently selected from the genes of Table 5; b) capturing markers 
from the sample; c) generating a first signal from the captured markers of step b); d) 
repeating steps a) to c) with a sample from a patient with a chronic liver disease and a low 
risk of HCC and thereby generating a second signal; e) repeating steps a) to c) with a sample 

20 from a patient with a chronic liver disease and an unknown risk of HCC and thereby 

generating a third signal; and f) comparing the third signal to the first and the second signals 
and thereby determining the risk of developing HCC in the patient of step e). In some 
embodiments, the set of cellular markers comprises at least 20, preferably 50, more 
preferably 100, and most preferably all genes or proteins encoded by genes independently 

25 selected from the genes of Table 5. In some othe embodiments, the set of cellular markers 
comprises the genes or proteins encodec by genes of Table 6 or Table 7. Preferably, the 
sample of steps a) and b), the sample of step d), and the sample of step e) are liver tissue 
extracts. In one preferred embodiment, the array of step a) is a genomic array. In another 
preferred embodiment, the array of step a) is a proteomic array. In some embodiments, the 

30 patient with a high risk of developing HCC suffers from hepatitis B infection, hepatitis C, 
hemachromatosis, or Wilson's disease. In other embodiments, the patient with a low risk of 
HCC suffers from alcoholic liver disease, autoimmune hepatitis, or primary biliary cirrhosis. 
In yet other embodiments, the patient whose risk of developing HCC is being assessed suffers 
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from hepatitis B, hepatitis C, hemochromatosis, Wilson's disease, alcoholic liver disease, 
autoimmune hepatitis, or primary biliary cirrhosis. 

[001 3 J Yet another aspect of the invention relates to a method for inhibiting metastasis in 
an HCC patient as well as a method for inhibiting the development of HCC in a patient with a 
5 chronic liver disease. The method for inhibiting HCC metastasis in an HCC patient includes 
the step of suppressing OPN activity. In some embodiments, suppression of OPN activity is 
accomplished by inhibiting OPN expression, preferably using an antisense polynucleotide 
specific for OPN. In other embodiments, suppression of OPN activity is accomplished by 
inhibiting the specific binding between OPN and OPN receptor, preferably using an anti- 

1 0 OPN antibody. The method for preventing the onset of HCC in a patient with a chronic liver 
disease includes the step of suppressing EpCAM activity. In some embodiments, suppression 
of EpCAM activity is accomplished by inhibiting EpCAM expression, preferably using an 
antisense polynucleotide or a small inhibitory RNA molecule specific for EpCAM. In other 
embodiments, suppression of EpCAM activity is accomplished by inhibiting the specific 

1 5 binding between EpCAM and EpCAM receptor, preferably using an anti-EpCAM antibody. 
[0014] A still further aspect of the present invention relates to a computer readable 
medium, a digital computer, and a system for accessing the metastatic potential in an HCC 
patient or the risk of developing HCC in a patient with a chronic liver disease. 

[001 5) The computer readable medium for assessing the metastatic potential in an HCC 
20 patient includes: a) code for a first data set, derived from a first signal from an array 

comprising capture reagents for a set of cellular markers after contact with a sample from a 
metastatic HCC patient, the set of cellular markers comprising at least 1 0 genes or proteins 
encoded by genes independently selected from the genes of Table 2; b) code for a second data 
set, derived from a second signal from an array identical to the array of a) after contact with a 
25 sample from a non-metastatic HCC patient; c) code for a third data set, derived from a third 
signal from an array identical to the array of a) after contact with a sample from a HCC 
patient with unknown metastatic potential; and d) code for comparing the third data set with 
the first and second data sets. A digital computer containing the claimed computer readable 
medium for assessing HCC metastatic potential in an HCC patient is also provided. Further 
30 provided is a system containing such a digital computer, a chip with an array comprising 

capture reagents for a set of cellular markers comprising at least 10 genes or proteins encoded 
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by genes independently selected from the genes of Table 2, and a reader capable of 
registering a signal from the array after contact with a sample. 

[0016] The computer readable medium for assessing the risk of developing HCC in a 
patient with a chronic liver disease includes: a) code for a first data set, derived from a first 
5 signal from an array comprising capture reagents for a set of cellular markers after contact 
with a sample from a patient with a chronic liver disease and a high risk of HCC, the set of 
cellular markers comprising at least 10 genes or proteins encoded by genes independently 
selected from the genes of Table 5; b) code for a second data set, derived from a second 
signal from an array identical to the array of a) after contact with a sample from a patient with 
10 a chronic liver disease and a low risk of HCC; c) code for a third data set, derived from a 
third signal from an array identical to the array of a) after contact with a sample from a 
patient with a chronic liver disease and an unknown risk of HCC; and d) code for comparing 
the third data set with the first and second data sets. A digital computer containing the 
claimed computer readable medium for assessing the risk of develop HCC in a patient with a 
1 5 chronic liver disease is also provided. Further provided is a system containing such a digital 
computer, a chip with an array comprising capture reagents for a set of cellular markers 
comprising at least 10 genes or proteins encoded by genes independently selected from the 
genes of Table 5, and a reader capable of registering a signal from the array after contact with 
a sample. 

DEFINITIONS 

(0017] Unless defined otherwise, all technical and scientific terms used herein have the 
meaning commonly understood by a person skilled in the art to which this invention belongs. 
The following references provide one of skill with a general definition of many of the terms 
used in this invention: Singleton et al. y Dictionary of Microbiology and Molecular Biology 
(2nd ed. 1994); The Cambridge Dictionary of Science and Technology (Walker ed., 1988); 
The Glossary of Genetics, 5th Ed., R. Rieger et ai (eds.), Springer Verlag (1991); and Hale & 
Marham, The Harper Collins Dictionary of Biology (1991). As used herein, the following 
terms have the meanings ascribed to them unless specified otherwise. 

[0018] The term "hepatocellular carcinoma'* or "HCC" as used herein refer to the major 
type of carcinoma of the liver that accounts for more than 90% of all primary liver cancers. 
Hepatocellular carcinomas range from well differentiated to highly anaplastic 
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undifferentiated lesions. Hepatocellular carcinomas may exist as single intra-hepatic lesions 
(non-metastatic), multifocal intra-hepatic metastasis or as extra-hepatic metastasis. 

[0019] "High risk precancerous diseases" refer to a group of epidemiological^ defined 
diseases that are associated with a high probability of developing HCC These diseases 
5 include chronic hepatitis B infection, hepatitis C infection, hemochromatosis, and Wilson's 
disease. 

[0020] "Low risk precancerous diseases" refer to a group of epidemiological^ defined 
diseases, that are associated with a low risk of developing HCC. These diseases include 
alcoholic liver disease, autoimmune hepatitis, and primary biliary cirrhosis. 

1 0 [002 1 ] The term "metastasis" or "metastatic" refers to the ability of a cancer cell to invade 
surrounding tissues, to enter the circulatory system and to establish malignant growths at new 
sites. 

[0022] "Non-Metastatic" refers to tumors that do not spread beyond their original site of 
development and specifically do not enter the circulatory system and establish malignant 
15 growths at new sites. 

[0023] The term "non-cancerous" refers to a biological sample or tissue sample in which 
the cells in the sample exhibit a normal or non-pathological phenotype when analyzed 
visually, by microscope, immunohistologically, immunologically, or molecularly using 
antibody or nucleic acid probes designed to detect pathological conditions. 

20 [0024] The term "normal" refers to a biological sample or tissue sample in which the 

sample is obtained from an individual who has not been diagnosed with HCC or high risk, or 
low risk precancerous diseases. 

[0025] The term "capture reagent" refers to any type of moiety that binds to a specific 
nucleic acid or protein marker. Typically the binding of the marker to the capture reagent can 
25 be controlled by the conditions used during the binding process. For example, the binding of 
a nucleic acid marker to a cognate oligonucleotide is controlled by the hybridization 
conditions used. Stringent hybridizations conditions will only allow a nucleic acid marker 
that has high homology e.g. 95%-100% identity with the oligonucleotide to bind to the 
oligonucleotide. 

30 [0026] "Array" refers to a plurality of capture reagents bound to a substrate, e.g., a solid 

support, which will bind to their cognate markers. For example, the array may be composed 
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of nucleic acid molecules, protein molecules or any other reagent that will specifically bind a 
nucleic acid, protein or polypeptide isolated from a biological sample. The capture reagents 
are preferentially bound in an addressable fashion such that when the cognate marker is 
bound to the capture reagent, the amount of binding may be quantified. 

5 [0027] "DNA microarray" refers to an array in which the capture reagents are nucleic acid 
molecules. Typically, a DNA microarray is composed of DNA oligonucleotides of a defined 
length which can hybridize to DNA, cDNA or RNA molecules under defined conditions. 
DNA oligonucleotides may be short pieces of nucleic acid ranging is size from 15-50 bases 
or they may be longer pieces of nucleic acids ranging in size from 500-1000 bases or longer. 
10 DNA microarrays may be composed of hundreds or thousands of different nucleic acid 

molecules each of which is located on the array in a defined position. Binding of the marker 
to the DNA microarray is usually quantified when the marker is labeled with a detectable 
moiety. The term DNA microarray is used interchangeably with the term "genomic array" 

[0028] "Protein array" refers to an array in which the capture reagents will bind protein 
1 5 markers. Typically these reagents may be polyclonal or monoclonal antibodies that bind 
specific proteins. Alternatively, any protein, peptide, nucleic acid or other molecule or 
surface which will specifically bind to a protein may be used in a protein array. These arrays 
usually contain hundreds or thousands of different capture reagents in addressable locations. 
Binding of the markers to the capture reagent on the protein array is usually quantified when 
20 the marker is labeled with a detectable moiety. The term protein array is used 
interchangeably with "proteomic array". 

[0029] "Gene expression profile" refers to the all of the genes that are expressed in a tissue 
sample compared to a reference sample. The level of gene expression of genes in a gene 
expression profile is determined by comparing the level of expression in a test sample e.g. an 

25 HCC tumor sample or a sample obtained from a patient diagnosed with severe liver disease to 
the level of expression in a reference sample. The reference sample used for determining the 
metastatic potential of an HCC tumor is non-cancerous liver tissue or liver tissue obtained 
from a patient who has not been diagnosed with HCC. The reference sample used for 
determining the potential for developing HCC in patients diagnosed with severe liver disease 

30 is liver tissue obtained from patients who have not been diagnosed with severe liver disease. 
Genes in the test sample may be over expressed or under expressed relative to the reference 
sample. 
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(0030] "Metastatic gene expression predictor" refers to the expression of a specific cluster 
of genes correlated with the diagnosis of metastatic HCC The metastatic gene expression 
predictor is generated by comparing the gene expression profile of a test sample obtained 
from a non-metastatic HCC sample to the gene expression profile obtained from a metastatic 
5 HCC sample followed by a cluster and classification analysis using a defined algorithm or set 
of algorithms. The number of genes present may vary depending on the clustering algorithm 
used or depending on a parameter in the algorithm e.g. p-level = 0.001 vs. 0.022. 

[0031] "HCC gene expression predictor" refers to the expression of a specific cluster of 
genes correlated with the diagnosis of patients likely to develop HCC. The HCC gene 

10 expression predictor is generated by comparing the gene expression profile of a test sample 
obtained from a non-metastatic liver sample obtained from a patient with a high risk for 
developing HCC to the gene expression profile obtained from a non-metastatic liver sample 
obtained from a patient having a low risk of developing HCC followed by a cluster and 
classification analysis using a defined algorithm or set of algorithms. The number of genes 

1 5 present may vary depending on the clustering algorithm used or depending on a parameter in 
the algorithm e.g. p-level = 0.001 vs. 0.022. 

[0032] "UG Cluster" used in Tables 2-7 refers to the UniGene data base compiled by the 
National Center for Biological Information ("NCBI"). Each accession number in the 
UniGene data base is a compilation of all of the nucleotide and amino acid sequence data 

20 available for a specific nucleotide sequence. For example, each UG Cluster accession 

number may provide links to GeneBank or other data base which in turn provide nucleotide 
sequences encoding a partial or full length cDNA for a gene. Alternatively the links may 
provide genomic or EST sequence data or amino acid sequence information. Each UG 
Cluster accession number provides unique sequence information for the specific gene, nucleic 

25 acid or amino acid sequence identified. 

[0033] "Ostoepontin" refers to a secreted phosphoprotein encoded by SEQ ID NO: 1 or a 
conservative variant thereof, which may also be found in Genbank accession number 
NMJ300582. Nucleic acid and amino acid sequence information may also be found in the 
National Center for Biological Information ("NCBI") UniGene data base under accession 
30 number Hs.313 at NCBI web site. This site lists 9 mRNA/genomic DNA sequences and over 
900 expressed sequence tags. Osteopontin is an extracellular protein associated with the bone 
matrix and associated with atherosclerotic plaques. Full length osteopontin protein contains 
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an RGD amino acid sequence that functions as an integrin binding site. Osteopontin is a 
major ligand for the vitronectin receptor. "OPN" is used interchangeably with osteopontin 
and refers either to the protein, the gene encoding the protein or fragments thereof 

[0034] "EpCAM" is a 40 kDa glycoprotein that functions as an Epithelial Cell Adhesion 
5 Molecule. It is also identified as tumor-associated calcium signal transducer or TACSTD1 , 
with a Unigene Cluster number of Hs.692. EpCAM is encoded by the GA733-2 gene, which 
is located on human chromosome 4q. A transmembrane protein expressed in cells of 
epithelial origin, EpCAM mediates Ca 2+ -independent homotypic cell-cell adhesion and is 
specifically recognized by a number of well known monoclonal antibodies (mAb), such as 
10 17-1A, 323/A3, KS1/4, GA733, MOC31, etc. 

[0035] The term "Marker" in the context of the present invention refers to a nucleic acid 
sequence or a gene encoding a polypeptide (of a particular apparent molecular weight) which 
is differentially present in a sample taken from patients having metastatic HCC or a 
predisposition for HCC as compared to a comparable sample taken from control subjects 

1 5 (e.g., a person with non-metastatic HCC or a negative diagnosis or undetectable cancer, 

normal or healthy subject). Marker may also refer to a polypeptide or protein encoded by a 
nucleic acid sequence or gene which is differentially present in a sample taken from patients 
having metastatic HCC or a predisposition for HCC as compared to a comparable sample 
taken from control subjects (e.g., a person with non-metastatic HCC or a negative diagnosis 

20 or undetectable cancer, normal or healthy subject). Markers of the present invention include 
the genes and their encoded proteins identified by UG Cluster number in Tables 2-7 infra. 

[0036] The term "sample" as used herein is a sample of biological tissue or fluid that will 
be used to determine a gene expression profile, a source of markers, or that contains a protein 
of interest (such as osteopontin or EpCAM) or a nucleic acid encoding such protein. Such 

25 samples include, but are not limited to, various types of tissue isolated from humans, and may 
also include sections of tissues such as frozen sections or paraffin sections taken for 
histological purposes. Tissues include liver samples and fluid samples include blood, serum, 
plasma, urine, and other bodily fluids. A preferred sample used for practicing the present 
invention is a lysate of cells extracted from a tissue of interest, e.g., liver. Such a cell lysate 

30 may be prepared using a variety of methods known to those skilled in the art, depending on 
the form in which a cellular marker is to be detected and examined, e.g., as a nucleic acid 
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such as mRNA, as a protein, or as a molecule with other measurable biological characteristics 
such as an enzymatic activity. 

[0037] The phrase "functional effects" in the context of assays for testing compounds that 
regulate the biological activity of a protein of interest, e.g., osteopontin or EpCAM, includes 
the determination of any parameter that is directly or indirectly related to or under the 
influence of OPN or EpCAM, such as the level of mRNA encoding the proteins, the level of 
the proteins, as well as their functional, physical, and chemical effects (e.g., their ability to 
specifically interact with their naturally binding partners, such as other proteins, nucleic 
acids, or any other molecules, their ability to mediate signal transduction that may affect 
cellular events such as cell proliferation, differentiation, apoptosis, secretion, adhesion, and 
the like). 

[0038] "Nucleic acid" refers to deoxyribonucleotides or ribonucleotides and polymers 
thereof in either single- or double-stranded form. The term encompasses nucleic acids 
containing known nucleotide analogs or modified backbone residues or linkages, which are 
synthetic, naturally occurring, and non-naturally occurring, which have similar binding 
properties as the reference nucleic acid, and which are metabolized in a manner similar to the 
reference nucleotides. Examples of such analogs include, without limitation, 
phosphorothioates, phosphoramidates, methyl phosphonates, chiral-methyl phosphonates, 2- 
O-methyl ribonucleotides, peptide-nucleic acids (PNAs). The term encompasses nucleic 
acids isolated from biological samples and synthetic oligonucleotides. 

[0039] Unless otherwise indicated, a particular nucleic acid sequence also implicitly 
encompasses conservatively modified variants thereof (e.g., degenerate codon substitutions) 
and complementary sequences, as well as the sequence explicitly indicated. Specifically, 
degenerate codon substitutions may be achieved by generating sequences in which the third 
position of one or more selected (or all) codons is substituted with mixed-base and/or 
deoxyinosine residues (Batzer el al, Nucleic Acid Res. 19:5081, 1991; Ohtsuka et al., J. Biol. 
Chem. 260:2605-2608, 1985; Rossolini etal.,Mol. Cell. Probes 8:91-98, 1994). The term 
nucleic acid is used interchangeably with gene, cDNA, mRNA, oligonucleotide, and 
polynucleotide. 

[0040] The terms "polypeptide," "peptide" and "protein" are used interchangeably herein to 
refer to a polymer of amino acid residues. The terms apply to amino acid polymers in which 
one or more amino acid residue is an artificial chemical mimetic of a corresponding naturally 
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occurring amino acid, as well as to naturally occurring amino acid polymers and non- 
naturally occurring amino acid polymer. 

[0041] The term "amino acid" refers to naturally occurring and synthetic amino acids, as 
well as amino acid analogs and amino acid mimetics that function in a manner similar to the 
5 naturally occurring amino acids. Naturally occurring amino acids are those encoded by the 
genetic code, as well as those amino acids that are later modified, e.g., hydroxyproline, y- 
carboxyglutamate, and O-phosphoserine. Amino acid analogs refer to compounds that have 
the same basic chemical structure as a naturally occurring amino acid, i.e., an a carbon that is 
bound to a hydrogen, a carboxyl group, an amino group, and an R group, e.g., homoserine, 
10 norleucine, methionine sulfoxide, methionine methyl sulfonium. Such analogs have modified 
R groups {e.g., norleucine) or modified peptide backbones, but retain the same basic chemical 
structure as a naturally occurring amino acid. Amino acid mimetics refer to chemical 
compounds that have a structure that is different from the general chemical structure of an 
amino acid, but that functions in a manner similar to a naturally occurring amino acid. 

1 5 [0042) Amino acids may be referred to herein by either their commonly known three letter 
symbols or by the one-letter symbols recommended by the IUPAC-IUB Biochemical 
Nomenclature Commission. Nucleotides, likewise, may be referred to by their commonly 
accepted single-letter codes. 

[0043] "Conservatively modified variants" applies to both amino acid and nucleic acid 
20 sequences. With respect to particular nucleic acid sequences, conservatively modified 

variants refers to those nucleic acids which encode identical or essentially identical amino 
acid sequences, or where the nucleic acid does not encode an amino acid sequence, to 
essentially identical sequences. Because of the degeneracy of the genetic code, a large 
number of functionally identical nucleic acids encode any given protein. For instance, the 
25 codons GCA, GCC, GCG and GCU all encode the amino acid alanine. Thus, at every 
position where an alanine is specified by a codon, the codon can be altered to any of the 
corresponding codons described without altering the encoded polypeptide. Such nucleic acid 
variations are "silent variations," which are one species of conservatively modified 
variations. Every nucleic acid sequence herein which encodes a polypeptide also describes 
30 every possible silent variation of the nucleic acid. One of skill will recognize that each codon 
in a nucleic acid (except AUG, which is ordinarily the only codon for methionine, and TGG, 
which is ordinarily the only codon for tryptophan) can be modified to yield a functionally 
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identical molecule. Accordingly, each silent variation of a nucleic acid which encodes a 
polypeptide is implicit in each described sequence. 

[0044] As to amino acid sequences, one of skill will recognize that individual substitutions, 
deletions or additions to a nucleic acid, peptide, polypeptide, or protein sequence which 
5 alters, adds or deletes a single amino acid or a small percentage of amino acids in the encoded 
sequence is a "conservatively modified variant" where the alteration results in the substitution 
of an amino acid with a chemically similar amino acid. Conservative substitution tables 
providing functionally similar amino acids are well known in the art. Such conservatively 
modified variants are in addition to and do not exclude polymorphic variants, interspecies 
10 homologs, and alleles of the invention. 

(0045] The following eight groups each contain amino acids that are conservative 
substitutions for one another: 

1) Alanine (A), Glycine (G); 

2) Aspartic acid (D), Glutamic acid (E); 
15 3) Asparagine (N), Glutamine (Q); 

4) Arginine (R), Lysine (K); 

5) Isoleucine (I), Leucine (L), Methionine (M), Valine (V); 

6) Phenylalanine (F), Tyrosine (Y), Tryptophan (W); 

7) Serine (S), Threonine (T); and 
20 8) Cysteine (C), Methionine (M) 

{see. e.g., Creighton, Proteins, 1984). 

[0046] Macromolecular structures such as polypeptide structures can be described in terms 
of various levels of organization. For a general discussion of this organization, see. e.g., 
Alberts et al., Molecular Biology of the Cell (3 rd ed., 1994) and Cantor and Schimmel, 

25 Biophysical Chemistry Part I: The Conformation of Biological Macromolecules (1 980). 
"Primary structure" refers to the amino acid sequence of a particular peptide. "Secondary 
structure" refers to locally ordered, three dimensional structures within a polypeptide. These 
structures are commonly known as domains. Domains are portions of a polypeptide that 
form a compact unit of the polypeptide and are typically 50 to 350 amino acids long. Typical 

30 domains are made up of sections of lesser organization such as stretches of [3-sheet and a- 
helices. "Tertiary structure" refers to the complete three dimensional structure of a 
polypeptide monomer. "Quaternary structure" refers to the three dimensional structure 
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formed by the noncovalent association of independent tertiary units. Anisotropic terms are 
also known as energy terms. 

[0047J "Antibody" refers to a polypeptide comprising a framework region from an 
immunoglobulin gene or fragments thereof that specifically binds and recognizes an antigen. 
5 The recognized immunoglobulin genes include the kappa, lambda, alpha, gamma, delta, 

epsilon, and mu constant region genes, as well as the myriad immunoglobulin variable region 
genes. Light chains are classified as either kappa or lambda. Heavy chains are classified as 
gamma, mu, alpha, delta, or epsilon, which in turn define the immunoglobulin classes, IgG, 
IgM, IgA, IgD and IgE, respectively. 

10 [0048] An exemplary immunoglobulin (antibody) structural unit comprises a tetramer. 

Each tetramer is composed of two identical pairs of polypeptide chains, each pair having one 
"light" (about 25 kDa) and one "heavy" chain (about 50-70 kDa). The N-terminus of each 
chain defines a variable region of about 100 to 1 10 or more amino acids primarily responsible 
for antigen recognition. The terms variable light chain (V L ) and variable heavy chain (V H ) 

1 5 refer to these light and heavy chains respectively. 

[0049] Antibodies exist, e.g., as intact immunoglobulins or as a number of well- 
characterized fragments produced by digestion with various peptidases. Thus, for example, 
pepsin digests an antibody below the disulfide linkages in the hinge region to produce F(ab) , 2 , 
a dimer of Fab which itself is a light chain joined to V H -C H 1 by a disulfide bond. The F(ab)' 2 

20 may be reduced under mild conditions to break the disulfide linkage in the hinge region, 
thereby converting the F(ab)' 2 dimer into an Fab' monomer. The Fab' monomer is 
essentially Fab with part of the hinge region {see Fundamental Immunology (Paul ed., 3d ed. 
1 993). While various antibody fragments are defined in terms of the digestion of an intact 
antibody, one of skill will appreciate that such fragments may be synthesized de novo either 

25 chemically or by using recombinant DNA methodology. Thus, the term antibody, as used 
herein, also includes antibody fragments either produced by the modification of whole 
antibodies, or those synthesized de novo using recombinant DNA methodologies {e.g., single 
chain Fv) or those identified using phage display libraries {see, e.g., McCafferty et al, Nature 
348:552-554, 1990). 

30 [0050) For preparation of monoclonal or polyclonal antibodies, any technique known in the 
art can be used {see, e.g., Kohler & Milstein, Nature 256:495-497 (1975); Kozbor et al., 
Immunology Today 4: 72 (1983); Cole et al, pp. 77-96 in Monoclonal Antibodies and Cancer 
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Therapy (1985)). Techniques for the production of single chain antibodies (U.S. Patent 
4,946,778) can be adapted to produce antibodies to polypeptides of this invention. Also, 
transgenic mice, or other organisms such as other mammals, may be used to express 
humanized antibodies. Alternatively, phage display technology can be used to identify 
5 antibodies and heteromeric Fab fragments that specifically bind to selected antigens (see, e.g. , 
McCafferty et al., supra; Marks et al., Biotechnology 10:779-783, 1992). 

(0051 J A "chimeric antibody" is an antibody molecule in which (a) the constant region, or a 
portion thereof, is altered, replaced or exchanged so that the antigen binding site (variable 
region) is linked to a constant region of a different or altered class, effector function and/or 
1 0 species, or an entirely different molecule which confers new properties to the chimeric 
antibody, e.g., an enzyme, toxin, hormone, growth factor, drug, etc.; or (b) the variable 
region, or a portion thereof, is altered, replaced or exchanged with a variable region having a 
different or altered antigen specificity. 

[0052] An "anti-OPN antibody" is an antibody or antibody fragment that specifically binds 
1 5 a polypeptide encoded by the OPN gene, cDNA, or a subsequence thereof. An anti-EpCAM 
antibody is defined in a similar fashion. 

f0053J A "receptor" as used herein encompasses any molecule that a particular protein, e.g., 
OPN or EpCAM, can specifically bind and may thus include proteins, nucleic acids, 
carbohydrates, or any other molecules. 

20 [0054] The term "immunoassay" is an assay that uses an antibody to specifically bind an 
antigen. The immunoassay is characterized by the use of specific binding properties of a 
particular antibody to isolate, target, and/or quantify the antigen. 

[0055] The phrase "specifically (or selectively) binds" to an antibody or "specifically (or 
selectively) immunoreactive with," when referring to a protein or peptide, refers to a binding 

25 reaction that is determinative of the presence of the protein in a heterogeneous population of 
proteins and other biologies. Thus, under designated immunoassay conditions, the specified 
antibodies bind to a particular protein at least two times the background and do not 
substantially bind in a significant amount to other proteins present in the sample. Specific 
binding to an antibody under such conditions may require an antibody that is selected for its 

30 specificity for a particular protein. For example, polyclonal antibodies raised to OPN from 
specific species such as rat, murine, or human can be selected to obtain only those polyclonal 
antibodies that are specifically immunoreactive with OPN and not with other proteins, except 
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for polymorphic variants and alleles of OPN. This selection may be achieved by subtracting 
out antibodies that cross-react with OPN molecules from other species. A variety of 
immunoassay formats may be used to select antibodies specifically immunoreactive with a 
particular protein. For example, solid-phase ELISA immunoassays are routinely used to 
5 select antibodies specifically immunoreactive with a protein {see, e.g., Harlow & Lane, 
Antibodies, A Laboratory Manual, 1988), for a description of immunoassay formats and 
conditions that can be used to determine specific immunoreactivity). Typically a specific or 
selective reaction will be at least twice background signal or noise and more typically more 
than 10 to 100 times background. 

1 0 [0056J The phrase "differentially present" refers to differences in the quantity and/or the 
frequency of a marker present in a sample taken from a metastatic HCC tumor or liver 
samples of a patient at high risk for HCC as compared to a non-metastatic HCC sample or a 
liver sample from a patient at low risk for HCC respectively. For examples, a marker can be 
a polypeptide or nucleic acid which is present at an elevated level or at a decreased level in 

1 5 samples of metastatic HCC tumors or liver samples of someone at high risk for HCC 

compared to non-metastatic HCC samples or a liver sample from a patient at low risk for 
HCC respectively. Alternatively, a marker can be a polypeptide which is detected at a higher 
frequency or at a lower frequency in metastatic HCC tumors or liver samples of someone at 
high risk for HCC compared to non-metastatic HCC sample or a liver sample from a patient 

20 at low risk for HCC respectively. A marker can be differentially present in ternis of quantity, 
frequency or both. 

[0057] A polypeptide or nucleic acid is differentially present between the two samples if 
the amount of the polypeptide in one sample is statistically significantly different from the 
amount of the polypeptide in the other sample. For example, a polypeptide is differentially 
25 present between the two samples if it is present at least about 120%, at least about 130%, at 
least about 150%, at least about 180%, at least about 200%, at least about 300%, at least 
about 500%, at least about 700%, at least about 900%, or at least about 1000% greater than it 
is present in the other sample, or if it is detectable in one sample and not detectable in the 
other. 

30 [0058] Alternatively or additionally, a polypeptide is differentially present between the two 
sets of samples if the frequency of detecting the polypeptide in the metastatic HCC tumors or 
liver samples of someone at high risk for HCC is statistically significantly higher or lower 
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than in non-metastatic HCC samples or a liver sample from a patient at low risk for HCC 
respectively. For example, a polypeptide is differentially present between the two sets of 
samples if it is detected at least about 120%, at least about 130%, at least about 150%, at least 
about 180%, at least about 200%, at least about 300%, at least about 500%, at least about 
5 700%, at least about 900%, or at least about 1000% more frequently or less frequently 
observed in one set of samples than the other set of samples. 

10059] "Diagnostic" means identifying the presence or nature of a pathologic condition or 
a predisposition for a pathologic condition such as HCC or HCC metastasis. Diagnostic 
methods differ in their sensitivity and specificity. The "sensitivity" of a diagnostic assay is 

10 the percentage of diseased individuals who test positive (percent of "true positives"). 

Diseased individuals not detected by the assay are "false negatives." Subjects who are not 
diseased and who test negative in the assay, are termed "true negatives." The "specificity" of 
a diagnostic assay is 1 minus the false positive rate, where the "false positive" rate is defined 
as the proportion of those without the disease who test positive. While a particular diagnostic 

1 5 method may not provide a definitive diagnosis of a condition, it suffices if the method 
provides a positive indication that aids in diagnosis. 

10060] A "test amount" of a marker refers to an amount of a marker present in a sample 
being tested. A test amount can be either in absolute amount (e.g., ug/ml) or a relative 
amount (e.g., relative intensity of signals). 

20 ]0061 ] A "diagnostic amount" of a marker refers to an amount of a marker in a subject's 
sample that is consistent with a diagnosis of metastatic HCC tumors or tissue samples of 
someone at high risk for HCC. A diagnostic amount can be either in absolute amount (e.g., 
pg/ml) or a relative amount (e.g., relative intensity of signals). 

[0062] A "control amount" of a marker can be any amount or a range of amount which is to 
25 be compared against a test amount of a marker. For example, a control amount of a marker 
can be the amount of a marker in a person without metastatic HCC tumors or tissue samples 
of someone at low risk for HCC. A control amount can be either in absolute amount (e.g., 
ug/ml) or a relative amount (e.g., relative intensity of signals). 

[0063] "Spectrometer probe" refers to a device that is removably insertable into a gas 
30 phase ion spectrometer and comprises a substrate having a surface for presenting a marker for 
detection. A spectrometer probe can comprise a single substrate or a plurality of substrates. 
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Terms such as ProteinChip , ProteinChip array, or chip are also used herein to refer to 
specific kinds of spectrometer probes. 

[0064) "Substrate" or "probe substrate" refers to a solid phase onto which an adsorbent can 
be provided (e.g., by attachment, deposition, etc.). 

5 [0065] "Adsorbent" refers to any material capable of adsorbing a marker. The term 

"adsorbent" is used herein to refer both to a single material ("monoplex adsorbent") (e.g., a 
compound or functional group) to which the marker is exposed, and to a plurality of different 
materials ("multiplex adsorbent") to which the marker is exposed. The adsorbent materials in 
a multiplex adsorbent are referred to as "adsorbent species." For example, an addressable 
10 location on a probe substrate can comprise a multiplex adsorbent characterized by many 
different adsorbent species (e.g., anion exchange materials, metal chelators, or antibodies), 
having different binding characteristics. Substrate material itself can also contribute to 
adsorbing a marker and may be considered part of an "adsorbent." 

[0066] "Adsorption" or "retention" refers to the detectable binding between an absorbent 
15 and a marker either before or after washing with an eluant (selectivity threshold modifier) or 
a washing solution. 

[0067J "Eluant" or "washing solution" refers to an agent that can be used to mediate 
adsorption of a marker to an adsorbent. Eluants and washing solutions are also referred to as 
"selectivity threshold modifiers." Eluants and washing solutions can be used to wash and 
20 remove unbound materials from the probe substrate surface. 

[0068] "Resolve," "resolution," or "resolution of marker" refers to the detection of at least 
one marker in a sample. Resolution includes the detection of a plurality of markers in a 
sample by separation and subsequent differential detection. Resolution does not require the 
complete separation of one or more markers from all other biomolecules in a mixture. 
25 Rather, any separation that allows the distinction between at least one marker and other 
biomolecules suffices. 

[0069] "Gas phase ion spectrometer" refers to an apparatus that measures a parameter 
which can be translated into mass-to-charge ratios of ions formed when a sample is 
volatilized and ionized. Generally ions of interest bear a single charge, and mass-to-charge 
30 ratios are often simply referred to as mass. Gas phase ion spectrometers include, for 

example, mass spectrometers, ion mobility spectrometers, and total ion current measuring 
devices. 
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[0070] "Mass spectrometer" refers to a gas phase ion spectrometer that includes an inlet 
system, an ionization source, an ion optic assembly, a mass analyzer, and a detector. 

[0071] "Laser desorption mass spectrometer" refers to a mass spectrometer which uses 
laser as means to desorb, volatilize, and ionize an analyte. 

5 [0072] "Detect" refers to identifying the presence, absence, or amount of the object to be 
detected. 

[00731 "Detectable moiety" or a "label" refers to a composition detectable by 
spectroscopic, photochemical, biochemical, immunochemical, or chemical means. For 
example, useful labels include 32 P, 35 S, fluorescent dyes, electron-dense reagents, enzymes 

1 0 (such as those commonly used in an ELISA, e.g., horseradish peroxidase), biotin- 

streptavidin, digoxigenin, haptens and proteins for which antisera or monoclonal antibodies 
are available, or nucleic acid molecules with a sequence complementary to a target. The 
detectable moiety often generates a measurable signal, such as a radioactive, chromogenic, or 
fluorescent signal, that can be used to quantify the amount of bound detectable moiety in a 

1 5 sample. Quantitation of the signal is achieved by, e.g., scintillation counting, densitometry, 
or flow cytometry. 

[0074] The term "activity" as used in the application refers to the biological functions of a 
molecule, such as a protein encoded by a gene of interest, e.g., osteopontin or EpCAM. This 
term .encompasses biological functions such as enzymatic activity, specific interaction with 
20 other molecules, regulatory effects on biological events at molecular or cellular level, and the 
like. 

[0075] The term "inhibiting" or "inhibition" as used herein refers to a negative regulatory 
effect on the function or activity of an intended target molecule, such that the function or 
activity, e.g., enzymatic activity or specific interaction with other molecules, is detectably 
25 diminished or effectively abolished. 

[0076] The term "antagonist" as used herein refers to a compound that is capable of 
negatively regulating the biological activity of a target molecule, e.g., osteopontin or 
EpCAM. An antagonist may effectuate the negative regulation by various means, such as by 
suppression of the expression of the target gene at transcriptional or translational level, or by 
30 interfering with the target molecule in its specific interaction with other molecules. 
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[0077] The term "antisense" as used in the context of describing a polynucleotide, refers to 
a single-stranded nucleic acid having a nucleotide sequence complementary to at least a 
portion of a target nucleic acid that encodes a protein of interest (e.g., osteopontin, or 
EpCAM), or the "sense" sequence. Complementarity between two single-stranded 
5 polynucleotides is based on the "A-T G-C" base-pairing rule. For example, the sequence "5'- 
AGAT-3\" is complementary to the sequence M 5'-ATCT-3'". Complementarity between a 
target nucleic acid and its antisense polynucleotide is typically 100%, i.e., all bases of the 
antisense polynucleotide match the with the bases of the target nucleic acid, but may be of 
varying degrees, i.e., there are may be some mis-matched bases. The degree of 
10 complementarity between a target nucleic acid and its antisense polynucleotide has 
significant effects on the efficiency and strength of hybridization. An "antisense" 
polynucleotide sequence in the present application may correspond to a coding portion (i.e., 
exon) or a non-coding portion (i.e., intron) of the target nucleic acid. 

BRIEF DESCRIPTION OF THE DRAWINGS 
1 5 (0078) Figure 1 . Classification of hepatocellular carcinoma with or without metastasis by 
gene expression. A) Multidimensional scaling analysis of 50 primary and metastatic HCC 
samples using 143 significant genes (p<0.0005) from supervised class comparison analysis of 
all 5 clinical groups, i.e., P, P-M, PT, PT-M, PN. The axes represent the first three principal 
components of these genes. P, primary HCC with intra-hepatic spreads; P-M, metastatic 
20 lesion of P; PT, primary HCC with tumor thrombus in portal vein; PN, metastasis- free 

primary HCC samples. B) Hierarchical clustering of 30 primary HCC samples from P, PT, 
and PN groups using 383 significant genes (p<0.0005) derived from supervised class 
comparison. 

[0079] Figure 2. Prediction of metastasis and survival with metastasis predictor model 
25 derived from "leave-one-out' cross- validated compound covariate predictor classification. A) 
Metastasis predictor model used in 40 training and testing HCC patients. The predictor was 
based on a training set (circle) including 10 PN and 10 PT primary HCC samples that were 
previously used in the compound covariate predictor classification and 20 primary blinded 
HCC samples that were not used in the training procedure. The predictor uses 1 53 significant 
30 genes that distinguish between these two groups. B) Multidimensional scaling analysis of 40 
primary HCC samples using 153 significant genes from the predictor. Patient IDs are 
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indicated. C) Kaplan-Meier survival curves for 40 PN, PT and P patients. Cross marks 
indicate time of censorship. 

[0080] Figure 3. Candidate genes associated with metastatic HCC. A) Hierarchical 
clustering of top 30 candidate genes whose expressions were altered largely in PT and PT-M, 
5 but rarely in PN. Each row represents an individual gene and each column represents an 
individual tumor sample. Genes were ordered by centered correlation and complete linkage 
according the ratio of its abundance to the median abundance of all genes among all tumor 
samples. Pseudo colors indicate differential expression: green squares, transcript levels 
below the median; black squares, transcript levels equal to the median; red squares, transcript 
1 0 levels greater than the median; gray squares, missing data. Dendrogram was based on 10 
primary PN (green) and 10 primary PT (red) samples. B) Relative expression ratio of OPN 
by cDNA microarray analysis in 10 primary PN samples (green bars) and 10 primary PT 
samples (red bars) with accompanying metastasis (black bars). C and D) Semi-quantitative 
RT-PCR analysis of OPN mRNA level in primary HCC samples with or without metastasis. 

15 [0081] Figure 4. Immunohistochemical analysis of osteopontin in normal liver and 

hepatocellular carcinoma. Primary tumor cells (tumor S30) show cytoplasmic osteopontin 
immunoreactivity, especially in the area with high density of vasculature (panels b and d), but 
fibrous septa region (panels b and d) or normal liver parenchyma cells show no reactivity 
(panels a and c; normal liver 914). Magnification, x50. (H&E, x50). 

20 [0082] Figure 5. Role of osteopontin in promoting HCC metastasis. A) The level of 

osteopontin of CCL13, SK-Hep-1, and Hep3B cells was determined by Western blotting with 
a rat monoclonal anti-OPN antibody. A monoclonal 6-actin antibody was used as internal 
control. Densitometry was used to quantify the amount of OPN, which was normalized to 
actin. OPN level is indicated as relative folds. B) CCL13, SK-Hep-1 or Hep3B cells were 

25 incubated with or without a murine recombinant osteopontin protein or a neutralizing 
antibody against osteopontin and their invasiveness was determined by the Matrigel 
Basement Membrane Cell Invasion Chamber. Data is an average of triplicate determinants 
for each condition and is expressed as the mean percent invasion (plus one standard 
deviation) through the Matrigel Matrix and membrane (matrigel chamber) relative to the 

30 migration through the control membrane (control chamber). C) The invasiveness of five 
additional HCC cell lines (SMMC7721, MHCC97, HuHl, HuH4 and HuH7) through 
matrigel matrix in responding to osteopontin neutralizing antibody was determined as above. 
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D) Representative lung tissue sections (H&E stain; magnification x 100) from mice at 35 days 
following s.c. injection of HCCLM3 cells without (upper panel) or with (bottom panel) anti- 
OPN neutralizing antibody are shown. Arrows indicate the tumor grades. E) Primary tumor 
size was monitored at various weeks following s.c. injection of HCCLM3 cells into nude 
5 mice. Data are an average of 1 0 mice. F) The formation of pulmonary metastases in nude 
mice was determined at 35 days following s.c. injection of HCCLM3 cells with or without 
anti-OPN antibody. The number of metastatic foci was quantified based on their grades. 
Data are an average of 10 mice per group. The groups with significant p values (<0.05) are 
indicated by the asterisk. 

10 [0083] Figure 6. Potential oncogenic role of EpCAM in HCC development, a) and b) The 
expression level of EpCAM in various chronic liver disease (CLD) liver samples as analyzed 
by microarray (a) or RT-PCR (b). c) EpCAM expression in cells from normal human 
fibroblasts (NHF-hTERT), normal liver (CCL13) and hepatoma (SK-Hep-1, Hep3B, Huhl, 
Huh4, Huh7, and HepG2) was analyzed by western blotting with a monoclonal antibody 

1 5 against EpCAM. A monoclonal antibody against beta-actin was used as an internal control, 
d) Cell proliferation of Hep3B, Huhl, and Huh4 cells was determined by MTT assay and data 
were an average of 3 independent experiments, e) Effective silencing of EpCAM expression 
by siRNA was determined by western blotting analysis, f) Growth inhibition of Hep3B cells 
by EpCAM siRNA as determined by MTT assay. 

20 

DETAILED DESCRIPTION OF THE INVENTION 
[0084] Hepatocellular carcinoma (HCC) is one of the most common and aggressive 
malignant tumors in the world, with high prevalence especially in Asia and Africa, and 
relatively low prevalence in Europe and North America (Parkin et al., CA Cancer J. Clin. 

25 49:33-64, 1999; Pisani et al., Int. J. Cancer 83:18-29, 1999). Recent studies indicate that the 
incidence of HCC in the U.S. and in the U.K. has significantly increased over the last two 
decades (Taylor-Robinson et al., Lancet 350:1142-1143, 1997; El-Serag and Mason, N. Eng. 
J. Med. 340:745-750, 1999). Most of the HCC patients are incurable due to their poor 
prognosis. Although routine screening of individuals who are at the risk for developing HCC 

30 may provide an opportunity for some patients with an extended life, many patients are still 
diagnosed with advanced HCC with little improved survival (see, e.g., Yang et al, J. Cancer 
Res. Clin. Oncol. 123:357-360, 1997; Izzo et al., Ann. Surg. 227:513-518, 1998). While a 
small subset of HCC patients qualifies for surgical intervention, the improvement on long- 
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term survival is only modest. The extremely poor prognosis of HCC is largely because of a 
high rate of recurrence after surgery, or intra-hepatic metastases that develop by invasion of 
the portal vein or spreading to other parts of the liver, whereas extrahepatic metastases are 
less common (see. e.g., Genda et al., Hepatology 30:1027-1036, 1999). These data indicate 
that the liver is the main target organ of HCC metastasis. It has been demonstrated in animal 
model systems as well as in patients that the portal vein is the main route for intrahepatic 
metastases of metastatic HCC cells {see, e.g., Mitsunobu et al., Clin. Exp. Metastasis 14:520- 
529, 1996). This specific feature of HCC underscores the need to develop an accurate 
molecular profiling model for better diagnosis and therapeutic targets for the treatment of 
HCC patients with intrahepatic metastases. 

[0085J Current studies have largely been focused on individual candidate genes (see, e.g., 
Osada et al., Hepatology 24:1460-1467, 1996; Guo et al., Hepatology 28:1481-1488, 1998; 
Hui et al., Int. J. Cancer 84:604-608, 1999), which may be insufficient to reflect the precise 
biological nature of metastatic HCC. The microarray technology has offered an opportunity 
to probe disease-related gene expressions at a global genome scale (see, e.g., Schena et al., 
Science 270:467-470, 1995). This approach has allowed the successful molecular 
classification of several human malignant tumors in regarding their stage, prognostic 
outcome, or response to therapy (Alizadeh et al., Nature 403:503-51 1, 2000; Bittner et al., 
Nature 406:536-540, 2000; Perou et al., Nature 406:747-752, 2000; Khan et al., Nat. Med. 
7:613-619, 2001 ; Pomeroy et al., Nature 415: 436-442, 2002; Shipp et al., Nat. Med. 8:68-74, 
2002). A few reports have dealt with the gene expression profiles of primary HCC samples 
(Okabe et al., Cancer Res. 61:2129-2137, 2001; Xu et al., Proc. Natl. Acad. Sci. U.S.A. 
98:15089.-15094, 2001). However, little is known about the molecular signatures associated 
with a poor prognostic feature of patients with metastatic HCC. 

[0086] Using cDNA microarray-based gene expression profiling, the global changes 
associated with metastasis are investigated. The initial goal was to identify genes that can 
discriminate primary tumors from their matched intra-hepatic metastatic lesions. It is 
revealed that intrahepatic metastatic lesions are indistinguishable from their primary tumors, 
regardless of tumor size, encapsulation, and patient's age, whereas primary metastasis-free 
HCC is distinct from primary HCC with metastasis. These data indicate that changes 
favoring intrahepatic metastasis are initiated in the primary HCC. Moreover, an important 
gene, osteopontin, a secreted phosphoprotein, emerges in HCC metastasis. Osteopontin 
overexpression correlated with primary HCC with metastatic potential and invasiveness of 



23 



WO 03/087766 



PCT/US03/10783 



liver tumor-derived cell lines in vitro, and an osteopontin-neutralizing antibody efficiently 
blocked in vitro invasion and in vivo pulmonary metastasis of HCC cells. These studies 
identify osteopontin both as a molecular marker for defining HCC patients with metastatic 
potential and as a potential therapeutic target for treating metastatic HCC. 

5 [0087] A similar approach is used to develop a gene expression prediction model for the 
potential to develop HCC in patients with chronic liver diseases. By comparing the gene 
expression profiles of patients epidemiological^ at high risk for developing HCC with the 
gene expression profile of patients epidemiologically at low risk for developing HCC, 
cellular markers are identified so as to allow the identification of individuals with chronic 

1 0 liver diseases at high risk for developing HCC. The patients with severe liver diseases 
include those diagnosed with chronic hepatitis B infection, hepatitis C infection, 
hemochromatosis, Wilson's disease, alcoholic liver disease, autoimmune hepatitis, and 
primary biliary cirrhosis. High risk precancerous diseases include chronic hepatitis B 
infection, hepatitis C infection, hemochromatosis, and Wilson's disease. Low risk 

1 5 precancerous diseases include alcoholic liver disease, autoimmune hepatitis, and primary 

biliary cirrhosis. One gene identified to be associated with elevated risk of developing HCC 
in patients with severe liver diseases is EpCAM. Growth suppression of liver cancer cells has 
been observed upon inhibition of EpCAM expression, identifying its important role in HCC 
development and as a therapeutic target for preventing HCC in patients with chronic liver 

20 diseases. 

[0088] One particular aspect of the invention provides methods for clustering co-regulated 
genes in patients suspected of having metastatic HCC or the potential to develop HCC into 
gene expression profiles. This section provides a more detailed discussion of methods for 
clustering co-regulated genes. 

25 I. DNA MICRO ARRAY ANALYSIS 

A. Gene expression profile Classification by Cluster Analysis 
[0089] For many applications of the present invention, it is desirable to find basis gene 
expression profiles that are co-regulated in the non-metastatic HCC samples, the metastatic 
HCC samples, the high risk for developing HCC samples and the low risk for developing 

30 HCC samples. A preferred embodiment for identifying such basis gene expression profiles 
involves clustering algorithms (for reviews of clustering algorithms, see, e.g., Fukunaga, 
1990, Statistical Pattern Recognition, 2nd Ed., Academic Press, San Diego; Everitt, 1974, 
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Cluster Analysis, London: Heinemann Educ. Books; Hartigan, 1975, Clustering Algorithms, 
New York: Wiley; Sneath and Sokal, 1973, Numerical Taxonomy, Freeman; Anderberg, 
1973, Cluster Analysis for Applications, Academic Press: New York). 

[0090J In some embodiments employing cluster analysis, the expression of a large number 
of genes is monitored in biological samples obtained from different sources A table of data 
containing the gene expression measurements is used for cluster analysis. Cluster analysis 
operates on a table of data which has the dimension m x k wherein m is the total number of 
conditions or perturbations and k is the number of genes measured. 

[0091] A number of clustering algorithms are useful for clustering analysis. Clustering 
algorithms use dissimilarities or distances between objects when forming clusters. In some 
embodiments, the distance used is Euclidean distance in multidimensional space. The 
Euclidean distance may be squared to place progressively greater weight on objects that are 
further apart. Alternatively, the distance measure may be the Manhattan distance. In other 
embodiments unsupervised hierarchical clustering of a table of data may be performed using 
the CLUSTER or TREEVIEW software (Eisen et al., Proc. Natl. Acad. Sci. U.S.A. 95:14863- 
14868, 1998) using median centered correlation and complete linkage. 

[0092) Various cluster linkage rules are useful for the methods of the invention. Single 
linkage, a nearest neighbor method, determines the distance between the two closest objects. 
By contrast, complete linkage methods determine distance by the greatest distance between 
any two objects in the different clusters. This method is particularly useful in cases when 
genes or other cellular constituents form naturally distinct "clumps." Alternatively, the un- 
weighted pair-group average defines distance as the average distance between all pairs of 
objects in two different clusters. This method is also very useful for clustering genes or other 
cellular constituents to form naturally distinct "clumps." Finally, the weighted pair-group 
average method may also be used. This method is the same as the unweighted pair-group 
average method except that the size of the respective clusters is used as a weight. This 
method is particularly useful for embodiments where the cluster size is suspected to be 
greatly varied (Sneath and Sokal, 1973, Numerical taxonomy, San Francisco. W. H. Freeman 
& Co.). Other cluster linkage rules, such as the unweighted and weighted pair-group centroid 
and Ward's method are also useful for some embodiments of the invention. See., e g, Ward, 
1963,/ Am. Stat Assn. 58:236; Hartigan, 1975, Clustering algorithms, New York: Wiley. 
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[00931 In one particularly preferred embodiment, the cluster analysis used is the BRB- 
ArrayTools software, an integrated package for the visualization and statistical analysis of 
cDNA microarray gene expression data developed by the Biometric Research Branch of the 
National Cancer Institute, for both unsupervised and supervised analyses. The Class 
5 Comparison Tool based on univariate F-tests may be used to find genes differentially 

expressed between predefined clinical groups at a significance level of P O.001 or 0.002. 
The permutation distribution of the F-statistic, based on 2000 random permutations may also 
used to confirm statistical significance. The multi-variate Compound Covariate Predictor 
(CCP) Tool with a "leave-one-out" cross-validation test using 2000 random permutations at a 

10 significant level of PO.001 may be used to classify predefined clinical groups based on their 
gene expression profiles. In each cross-validation step one sample is omitted and a 
multivariate CCP is created based on the genes that are univariately significant at the 
specified level in the training set consisting of the samples not omitted. This CCP is used to 
classify the omitted sample and it is then noted whether the classification is correct or 

15 incorrect. This is repeated with all samples excluded one at a time. The total cross- validated 
misclassification rate is thereby determined. The statistical significance of the cross- 
validated misclassification rate is determined by repeating the entire cross-validation 
procedure to data with the class membership labels randomly permuted 2000 times. The 
CCP is based on a weighted linear combination of gene expression variables that are 

20 univariately significant in the training set with the weights being the corresponding t-statistics 
as described in Radmacher et al., Journal of Computational Biology, in press, 2002. An 
example of a clustering 'tree" output is shown in Figures 1 and 3 (see, also, Example 1, infra). 

[0094] Gene expression profiles may be defined based on the many smaller branches in the 
tree, or a small number of larger branches by cutting across the tree at different levels. The 

25 choice of cut level may be made to match the number of distinct clinical groups expected. If 
little or no prior information is available about the number of groups, then the tree should be 
divided into as many branches as are truly distinct. Truly distinct' may be defined by a 
minimum distance value between the individual branches. This distance is the vertical 
coordinate of the horizontal connector joining two branches (see Figure IB). Typical values 

30 are in the range 0.2 to 0.4 where 0 is perfect correlation and 1 is zero correlation, but may be 
larger for poorer quality data or fewer experiments in the training set, or smaller in the case of 
better data and more experiments in the training set. 
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[0095] Preferably, truly distinct' may be defined with an objective test of statistical 
significance for each bifurcation in the tree. In one aspect of the invention, the Compound 
Covariat Predictor (CCP) tool with "leave one out" cross-validation test using 2000 random 
permutations at a predefined significant level is used to define an objective test. The 
5 distribution of tractional improvements obtained from the CCP procedure is an estimate of 
the distribution under the null hypothesis that a particular classification is correct or incorrect. 

[0096] Another aspect of the cluster analysis method of this invention provides the 
definition of basis vectors for use in profile projection described in the following sections. 

B. Profile Comparison and Classification 

1 0 [0097] One aspect of the invention provides methods for drug discovery. In one 

embodiment, gene expression profiles are defined using cluster analysis. The genes within a 
gene expression profile are indicated as potentially co-regulated under the conditions of 
interest. Co-regulated genes are further explored as potentially being involved in a regulatory 
pathway. Identification of genes involved in a regulatory pathway provides useful 

1 5 information for designing and screening new drugs. 

[0098] In some embodiments of the invention, drug candidates are screened for their 
therapeutic activity. In one embodiment, desired drug activity is to affect one particular 
genetic regulatory pathway. In this embodiment, drug candidates are screened for their 
ability to affect the gene expression profile corresponding to the regulatory pathway. In 
20 another embodiment, a new drug is desired to replace an existing drug. In this embodiment, 
the projected profiles of drug candidates are compared with that of the existing drug to 
determine which drug candidate has activities similar to the existing drug. 

[0099] In some embodiments, the method of the invention is used to decipher pathway 
arborization and kinetics. When a receptor is triggered (or blocked) by a ligand, the 

25 excitation of the downstream pathways can be different depending on the exact temporal 
profile and molecular domains of the ligand interaction with the receptor. Simple examples 
of the differing effects of different ligands are the phenotypical differences that arise between 
responses to agonists, partial agonists, negative antagonists, and antagonists, and that are 
expected to occur in response to covalent vs. noncovalent binding and activation of different 

30 molecular domains on the receptor. See, Ross, Pharmacodynamics: Mechanisms of Drug 

Action and the Relationship between Drug Concentration and Effect in The Pharmacological 
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Basis of Therapeutics (Gilman et al. ed., McGraw Hill, New York, 1996) FIG. 4A illustrates 
two different possible responses of a pathway cascade. 

[0100] In some embodiments of the invention, receptors for ligands such as OPN may be 
investigated using the projection method of the invention to simplify the observed temporal 
5 responses to receptor/ligand interactions over the responding genes. In some particularly 
preferred embodiments, the gene expression profiles and temporal profiles involved are 
discovered. The profile of temporal responses of a large number of genes are projected onto 
the predefined gene expression profiles to obtain a projected profile of temporal responses. 
The projection process simplifies the observed responses so that different temporal responses 
1 0 may be detected and discriminated more accurately. 

C. Illustrative Diagnostic Applications 
[0101] One aspect of the invention provides methods for diagnosing diseases of humans, 
animals and plants. Those methods are also useful for monitoring the progression of diseases 
and the effectiveness of treatments. 

1 5 [0102] In one embodiment of the invention, a patient cell sample such as a biopsy from a 
patients diseased tissue such as metastatic HCC, is assayed for the expression of a large 
number of genes. The gene expression profile is projected into a profile of gene expression 
profile expression values according to a definition of gene expression profiles. The projected 
profile is then compared with a reference database containing reference projected profiles. If 

20 the projected profile of the patient matches best with a cancer profile in the database, the 

patient's diseased tissue is diagnosed as being cancerous. Similarly, when the best match is to 
a profile of another disease or disorder, a diagnosis of such other disease or disorder is made. 

[0103] In another embodiment, a tissue sample is obtained from a patient's tumor. The 
tissue sample is assayed for the expression of a large number of genes of interest. The gene 

25 expression profile is projected into a profile of gene expression profile expression values 

according to a definition of gene expression profiles. The projected profile is compared with 
projected profiles previously obtained from the same tumor to identify the change of 
expression in gene expression profiles. A reference library is used to determine whether the 
gene expression profile changes indicate tumor progression such as metastasis. A similar 

30 method is used to stage other diseases and disorders. Changes of gene expression profile 

expression values in a profile obtained from a patient under treatment can be used to monitor 
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the effectiveness of the treatment, for example, by comparing the projected profile prior to 
treatment with that after treatment. 

D* Analytic Kit Implementation 
[0104] In a preferred embodiment, the methods of this invention can be implemented by 
5 use of kits for determining the responses or state of a biological sample. Such kits contain 
microarrays, such as those described in subsections below. The microarrays contained in 
such kits comprise a solid phase, e.g., a surface, to which probes are hybridized or bound at a 
known location of the solid phase. Preferably, these probes consist of nucleic acids of 
known, different sequence, with each nucleic acid being capable of hybridizing to an RNA 

10 species or to a cDNA species derived therefrom. In particular, the probes contained in the 
kits of this invention are nucleic acids capable of hybridizing specifically to nucleic acid 
sequences derived from RNA species which are known to increase or decrease in response to 
perturbations to the particular protein whose activity is determined by the kit. The probes 
contained in the kits of this invention preferably substantially exclude nucleic acids which 

1 5 hybridize to RNA species that are not increased in response to perturbations to the particular 
protein whose activity is determined by the kit, such as osteopontin. 

[0105] In a preferred embodiment, a kit of the invention also contains a database of gene 
expression profile definitions such as the databases described above or an access 
authorization to use the database described above from a remote networked computer. 

20 [01061 In another preferred embodiment, a kit of the invention further contains expression 
profile projection and analysis software capable of being loaded into the memory of a 
computer system such as the one described supra in the subsection, and illustrated in 
Example 1. The expression profile analysis software contained in the kit of this invention, is 
essentially identical to the expression profile analysis software described above in Example 1. 

25 [0107) Alternative kits for implementing the analytic methods of this invention will be 
apparent to one of skill in the art and are intended to be comprehended within the 
accompanying claims. In particular, the accompanying claims are intended to include the 
alternative program structures for implementing the methods of this invention that will be 
readily apparent to one of skill in the art. 
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£. Methods for Determining Biological Response Profiles 
[0108] This invention utilizes the ability to measure the responses of a biological system to 
a large variety of perturbations. This section provides some exemplary methods for 
measuring biological responses. One of skill in the art would appreciate that this invention is 
5 not limited to the following specific methods for measuring the responses of a biological 
system. 

1. Transcript Assay Using DNA Array 
[0109] This invention is particularly useful for the analysis of gene expression profiles. 
One aspect of the invention provides methods for defining co-regulated gene expression 
10 profiles based upon the correlation of gene expression. Some embodiments of this invention 
are based on measuring the transcriptional rate of genes. 

[0110] The transcriptional rate can be measured by techniques of hybridization to arrays of 
nucleic acid or nucleic acid mimic probes, described in the next section, or by other gene 
expression technologies, such as those described in the subsequent subsection. However 
15 measured, the result is either the absolute, relative amounts of transcripts or response data 
including values representing RNA abundance ratios, which usually reflect DNA expression 
ratios (in the absence of differences in RNA degradation rates). 

[01 1 1 ] In various alternative embodiments of the present invention, aspects of the 
biological state other than the transcriptional state, such as the translational state, the activity 
20 state, or mixed aspects can be measured. 

[01 12] Preferably, measurement of the transcriptional state is made by hybridization to 
DNA microarrays, which are described in this section. Certain other methods of 
transcriptional state measurement are described later in this subsection. 

[01 13) In a preferred embodiment the present invention makes use of DNA microarrays. 
25 DNA microarrays can be employed for analyzing the transcriptional state in a biological 

sample and especially for measuring the transcriptional states of a biological sample exposed 
to graded levels of a drug of interest or to graded perturbations to a biological pathway of 
interest. 

|01 14] In one embodiment, DNA microarrays are produced by hybridizing detectably 
30 labeled polynucleotides representing the mRNA transcripts present in a cell (e.g., 
fluorescently labeled cDNA synthesized from total cell mRNA) to a microarray. A 
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microarray is a surface with an ordered array of binding (e.g., hybridization) sites for 
products of many of the genes in the genome of a cell or organism, preferably most or almost 
all of the genes. Microarrays can be made in a number of ways, of which several are 
described below. However produced microarrays share certain preferred characteristics: The 
arrays are reproducible, allowing multiple copies of a given array to be produced and easily 
compared with each other. Preferably the microarrays are small, usually smaller than 5 2 cm, 
and they are made from materials that are stable under binding (e.g., nucleic acid 
hybridization) conditions. A given binding site or unique set of binding sites in the 
microarray will specifically bind the product of a single gene in the cell. Although there may 
be more than one physical binding site (hereinafter "site") per specific mRNA, for the sake of 
clarity the discussion below will assume that there is a single site. 

(01 15] It will be appreciated that when cDNA complementary to the RNA of a cell is made 
and hybridized to a microarray under suitable hybridization conditions, the level of 
hybridization to the site in the array corresponding to any particular gene will reflect the 
prevalence in the cell of mRNA transcribed from that gene. For example, when detectably 
labeled (e.g., with a fluorophore) cDNA complementary to the total cellular mRNA is 
hybridized to a microarray, the site on the array corresponding to a gene (i.e., capable of 
specifically binding the product of the gene) that is not transcribed in the cell will have little 
or no signal (e.g., fluorescent signal), and a gene for which the encoded mRNA is prevalent 
will have a relatively strong signal. 

[01 16] In preferred embodiments, cDNAs from two different cells are hybridized to the 
binding sites of the microarray. In the case of drug responses one biological sample is 
exposed to a drug and another biological sample of the same type is not exposed to the drug. 
In the case of pathway responses one cell is exposed to a pathway perturbation and another 
cell of the same type is not exposed to the pathway perturbation. The cDNA derived from 
each of the two cell types are differently labeled so that they can be distinguished. In one 
embodiment, for example, cDNA from a cell treated with a drug (or exposed to a pathway 
perturbation) is synthesized using a fluorescein-labeled dNTP, and cDNA from a second cell, 
not drug-exposed, is synthesized using a rhodamine-labeled dNTP. When the two cDNAs 
are mixed and hybridized to the microarray, the relative intensity of signal from each cDNA 
set is determined for each site on the array, and any relative difference in abundance of a 
particular mRNA detected. 
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[01 17] In the example described above, the cDNA from the drug-treated (or pathway 
perturbed) cell will fluoresce green when the fluorophore is stimulated and the cDNA from 
the untreated cell will fluoresce red. As a result, when the drug treatment has no effect, either 
directly or indirectly, on the relative abundance of a particular mRNA in a cell, the mRNA 
5 will be equally prevalent in both cells and, upon reverse transcription, red-labeled and green- 
labeled cDNA will be equally prevalent. When hybridized to the microarray, the binding 
site(s) for that species of RNA will emit wavelengths characteristic of both fluorophores (and 
appear brown in combination). In contrast, when the drug-exposed cell is treated with a drug 
that, directly or indirectly, increases the prevalence of the mRNA in the cell, the ratio of 
10 green to red fluorescence will increase. When the drug decrease the mRNA prevalence, the 
ratio will decrease. 

[0118] The use of a two-color fluorescence labeling and detection scheme to define 
alterations in gene expression has been described in, e.g., Shena et al., "Quantitative 
monitoring of gene expression patterns with a complementary DNA microarray," Science 

15 270:467-470, 1995, which is incorporated by reference in its entirety for all purposes. An 
advantage of using cDNA labeled with two different fluorophores is that a direct and 
internally controlled comparison of the mRNA levels corresponding to each arrayed gene in 
two cell states can be made, and variations due to minor differences in experimental 
conditions (e.g., hybridization conditions) will not affect subsequent analyses. However, it 

20 will be recognized that it is also possible to use cDNA from a single cell, and compare, for 
example, the absolute amount of a particular mRNA in, e.g., a drug-treated or pathway- 
perturbed cell and an untreated cell. 

2. Preparation of Microarrays 

[01 19] Microarrays are known in the art and consist of a surface to which probes that 
25 correspond in sequence to gene products (e.g., cDNAs, mRNAs, cRNAs, polypeptides, and 
fragments thereof), can be specifically hybridized or bound at a known position. In one 
embodiment, the microarray is an array (i.e., a matrix) in which each position represents a 
discrete binding site for a product encoded by a gene (e.g., a protein or RNA), and in which 
binding sites are present for products of most or almost all of the genes in the organism's 
30 genome. In a preferred embodiment, the "binding site" (hereinafter, "site") is a nucleic acid 
or nucleic acid analogue to which a particular cognate cDNA can specifically hybridize. The 
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10 



nucleic acid or analogue of the binding site can be, e.g., a synthetic oligomer, a full-length 
cDNA, a less-than full length cDNA, or a gene fragment. 

[01 20] Although in a preferred embodiment the microarray contains binding sites for 
products of all or almost all genes in the target organism's genome, such comprehensiveness 
is not necessarily required. Usually the microarray will have binding sites corresponding to 
at least about 50% of the genes in the genome, often at least about 75%, more often at least 
about 85%, even more often more than about 90%, and most often at least about 99%. 
Preferably, the microarray has binding sites for genes relevant to the action of a drug of 
interest or in a biological pathway of interest. A "gene" is identified as an open reading 
frame (ORF) of preferably at least 50, 75, or 99 amino acids from which a messenger RNA is 
transcribed in the organism (e.g., if a single cell) or in some cell in a multicellular organism. 
The number of genes in a genome can be estimated from the number of mRNAs expressed by 
the organism, or by extrapolation from, a well-characterized portion of the genome. When 
the genome of the organism of interest has been sequenced, the number of ORFs can be 
1 5 determined and mRNA coding regions identified by analysis of the DNA sequence. For 
example, the Saccharomyces cerevisiae genome has been completely sequenced and is 
reported to have approximately 6275 open reading frames (ORFs) longer than 99 amino 
acids. Analysis of these ORFs indicates that there are 5885 ORFs that are likely to specify 
protein products (Goffeau et al., 1996, Life with 6000 genes, Science 274:546-567, which is 
20 incorporated by reference in its entirety for all purposes). In contrast, the human genome is 
estimated to contain approximately 5xl0 4 genes. 

3. Preparing Nucleic Acids for Microarrays 
[0121] As noted above, the "binding site" to which a particular cognate cDNA specifically 
hybridizes is usually a nucleic acid or nucleic acid analogue attached at that binding site. In 
one embodiment, the binding sites of the microarray are DNA polynucleotides corresponding 
to at least a portion of each gene in an organism's genome. These DNAs can be obtained by, 
e.g., polymerase chain reaction (PCR) amplification of gene segments from genomic DNA, 
cDNA (e.g., by RT-PCR), or cloned sequences. PCR primers are chosen, based on the known 
sequence of the genes or cDNA, that result in amplification of unique fragments (i.e., 
fragments that do not share more than 10 bases of contiguous identical sequence with any 
other fragment on the microarray). Computer programs are useful in the design of primers 
with the required specificity and optimal amplification properties. See, e.g., Oligo version 
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5.0 (National Biosciences). In the case of binding sites corresponding to very long genes, it 
will sometimes be desirable to amplify segments near the 3* end of the gene so that when 
oligo-dT primed cDNA probes are hybridized to the microarray, less-than-fiill length probes 
will bind efficiently. Typically each gene fragment on the microarray will be between about 
5 50 bp and about 2000 bp, more typically between about 100 bp and about 1000 bp, and 

usually between about 300 bp and about 800 bp in length. PCR methods are well known and 
are described, for example, in Innis et al. eds., 1990, PCR Protocols: A Guide to Methods and 
Applications, Academic Press Inc., San Diego, Calif., which is incorporated by reference in 
its entirety for all purposes. It will be apparent that computer controlled robotic systems are 
10 useful for isolating and amplifying nucleic acids. 

[0122] An alternative means for generating the nucleic acid for the microarray is by 
synthesis of synthetic polynucleotides or oligonucleotides, e.g., using N-phosphonate or 
phosphoramidite chemistries (Froehler et al., 1986, Nucleic Acid, Res 14:5399-5407; 
McBride et al., 1983, Tetrahedron Lett. 24:245-248). Synthetic sequences are between about 

15 15 and about 500 bases in length, more typically between about 20 and about 50 bases. In 

some embodiments, synthetic nucleic acids include non-natural bases, e.g., inosine. As noted 
above, nucleic acid analogues may be used as binding sites for hybridization. An example of 
a suitable nucleic acid analogue is peptide nucleic acid (see, e.g., Egholm et al., 1993, PNA 
hybridizes to complementary oligonucleotides obeying the Watson-Crick hydrogen-bonding 

20 rules, Nature 365:566-568; see also U.S. Pat. No. 5,539,083). 

[0123] In an alternative embodiment, the binding (hybridization) sites are made from 
plasmid or phage clones of genes, cDNAs (e.g., expressed sequence tags), or inserts 
therefrom (Nguyen et al., 1995, Differential gene expression in the murine thymus assayed by 
quantitative hybridization of arrayed cDNA clones, Genomics 29:207-209). In yet another 
25 embodiment, the polynucleotide of the binding sites is RNA. 

4. Attaching Nucleic Acids to the Solid Surface 
[0124] The nucleic acid or analogue are attached to a solid support, which may be made 
from glass, plastic (e.g., polypropylene, nylon), polyacrylamide, nitrocellulose, or other 
materials. A preferred method for attaching the nucleic acids to a surface is by printing on 
30 glass plates, as is described generally by Schena et al., 1995, Quantitative monitoring of gene 
expression patterns with a complementary DNA microarray, Science 270:467-470. This 
method is especially useful for preparing microarrays of cDNA. See also DeRisi et al., 1996, 
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Use of a cDNA microarray to analyze gene expression patterns in human cancer, Nature 
Genetics 14:457-460; Shalon et al., 1996, A DNA microarray system for analyzing complex 
DNA samples using two-color fluorescent probe hybridization, Genome Res. 6:639-645; and 
Schena et al., 1995, Parallel human genome analysis; microarray-based expression of 1000 
genes, Proc. Natl. Acad. Sci. USA 93:10539-11286. 

[01 25] A second preferred method for making microarrays is by making high-density 
oligonucleotide arrays. Techniques are known for producing arrays containing thousands of 
oligonucleotides complementary to defined sequences, at defined locations on a surface using 
photolithographic techniques for synthesis in situ (see, Fodor et al., 1991, Light-directed 
spatially addressable parallel chemical synthesis, Science 251:767-773; Pease et al., 1994, 
Light-directed oligonucleotide arrays for rapid DNA sequence analysis, Proc. Natl. Acad. Sci. 
USA 91 :5022-5026; Lockhart et al., 1996, Expression monitoring by hybridization to high- 
density oligonucleotide arrays, Nature Biotech 14:1675; U.S. Pat. Nos. 5,578,832; 5,556,752; 
and 5.510,270, each of which is incorporated by reference in its entirety for all purposes) or 
other methods for rapid synthesis and deposition of defined oligonucleotides (Blanchard et 
al., 1996, High-Density, Oligonucleotide arrays, Biosensors & Bioelectronics 11: 687-90). 
When these methods are used, oligonucleotides (e.g., 20-mers) of known sequence are 
synthesized directly on a surface such as a derivatized glass slide. Usually, the array 
produced contains multiple probes against each target transcript. Oligonucleotide probes can 
be chosen to detect alternatively spliced mRNAs or to serve as various type of control. 

[0126J Another preferred method of making microarrays is by use of an inkjet printing 
process to synthesize oligonucleotides directly on a solid phase. 

(0127J Other methods for making microarrays, e.g., by masking (Maskos and Southern, 
1992, Nuc. Acids Res. 20:1679-1684), may also be used. In principal, any type of array, for 
example, dot blots on a nylon hybridization membrane (see Sambrook and Russell, 
Molecular Cloning: A Laboratory Manual 3d ed, Cold Spring Harbor Laboratory, Cold 
Spring Harbor, N.Y., 2001), could be used, although, as will be recognized by those of skill 
in the art, very small arrays will be preferred because hybridization volumes will be smaller. 

5. Generating Labeled Probes 
[0128] Methods for preparing total and poly(A)+ RNA are well known and are described 
generally in Sambrook et al., supra. In one embodiment, RNA is extracted from biological 
samples of the various types of interest in this invention using guanidinium thiocyanate lysis 
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followed by CsCl centrifugation (Chirgwin et al., 1979, Biochemistry 18:5294-5299). 
Alternatively, total RNA may be extracted from samples using TRIzol reagent (Life 
Technologies) according to manufacturer's directions. Poly(A)+ RNA is selected by 
selection with oligo-dT cellulose (see Sambrook and Russell, supra). Biological samples of 
5 interest include normal liver samples, non-cancerous liver samples and samples from defined 
clinical specimens. 

[01 29] Labeled cDNA is prepared from mRNA by oligo dT-primed or random-primed 
reverse transcription, both of which are well known in the art (see, e.g., Klug and Berger, 
1987, Methods Enzymol. 152:316-325). Reverse transcription may be carried out in the 

1 0 presence of a dNTP conjugated to a detectable label, most preferably a fluorescently labeled 
dNTP. Alternatively, isolated mRNA can be converted to labeled antisense RNA synthesized 
by in vitro transcription of double-stranded cDNA in the presence of labeled dNTPs 
(Lockhart et a!., 1996, Expression monitoring by hybridization to high-density 
oligonucleotide arrays, Nature Biotech. 14:1675, which is incorporated by reference in its 

15 entirety for all purposes). In alternative embodiments, the cDNA or RNA probe can be 
synthesized in the absence of detectable label and may be labeled subsequently, e.g., by 
incorporating biotinylated dNTPs or rNTP, or some similar means (e.g., photo-cross-linking a 
psoralen derivative of biotin to RNAs), followed by addition of labeled streptavidin (e.g., 
phycoerythrin-conjugated streptavidin) or the equivalent. 

20 |0130] When fluorescently-labeled probes are used, many suitable fluorophores are known, 
including fluorescein, lissamine, phycoerythrin, rhodamine (Perkin Elmer Cetus), Cy2, Cy3, 
Cy3.5, Cy5, Cy5.5, Cy7, FluorX (Amersham) and others (see, e.g., Kricka, 1992, 
Nonisotopic DNA Probe Techniques, Academic Press San Diego, Calif). It will be 
appreciated that pairs of fluorophores are chosen that have distinct emission spectra so that 

25 they can be easily distinguished. 

[0131 ] In another embodiment, a label other than a fluorescent label is used. For example, 
a radioactive label, or a pair of radioactive labels with distinct emission spectra, can be used 
(see Zhao et al., 1995, High density cDNA filter analysis: a novel approach for large-scale, 
quantitative analysis of gene expression, Gene 156:207; Pietu et al., 1996, Novel gene 
30 transcripts preferentially expressed in human muscles revealed by quantitative hybridization 
of a high density cDNA array, Genome Res. 6:492). However, because of scattering of 
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radioactive particles, and the consequent requirement for widely spaced binding sites, use of 
radioisotopes is a less-preferred embodiment. 

[0132] In one embodiment, labeled cDNA is synthesized by incubating a mixture 
containing 0.5 mM dGTP, dATP and dCTP plus 0.1 mM dTTP plus fluorescent 
5 deoxyribonucleotides (e.g., 0. 1 mM Rhodamine 1 1 0 UTP (Perken Elmer Cetus) or 0. 1 mM 
Cy3 dUTP (Amersham)) with reverse transcriptase (e.g., SuperScript.TM.il, LTI Inc.) at 
42°C for 60 minutes. 

6. Hybridization to Microarrays 
f0133] Nucleic acid hybridization and wash conditions are optimally chosen so that the 

10 probe "specifically binds" or "specifically hybridizes" to a specific array site, i.e., the probe 
hybridizes, duplexes or binds to a sequence array site with a complementary nucleic acid 
sequence but does not hybridize to a site with a non-complementary nucleic acid sequence. 
As used herein, one polynucleotide sequence is considered complementary to another when, 
if the shorter of the polynucleotides is less than or equal to 25 bases, there are no mismatches 

1 5 using standard base-pairing rules or, if the shorter of the polynucleotides is longer than 25 
bases, there is no more than a 5% mismatch. Preferably, the polynucleotides are perfectly 
complementary (no mismatches). It can easily be demonstrated that specific hybridization 
conditions result in specific hybridization by carrying out a hybridization assay including 
negative controls (see, e.g., Shalon et al., supra, and Chee et ah, supra). 

20 [0134] Optimal hybridization conditions will depend on the length (e.g., oligomer versus 
polynucleotide greater than 200 bases) and type (e.g., RNA, DNA, PNA) of labeled probe 
and immobilized polynucleotide or oligonucleotide. General parameters for specific (i.e., 
stringent) hybridization conditions for nucleic acids are described in Sambrook et al, supra, 
and in Ausubel et al., 1987, Current Protocols in Molecular Biology, Greene Publishing and 

25 Wiley-Interscience, New York. When the cDNA microarrays of Schena et al. are used, 
typical hybridization conditions are hybridization in 5xSSC plus 0.2% SDS at 65°C. for 4 
hours followed by washes at 25°C. in low stringency wash buffer (lxSSC plus 0.2% SDS) 
followed by 10 minutes at 25°C in high stringency wash buffer (O.lxSSC plus 0.2% SDS) 
(Shena et al., 1996, Proc. Natl. Acad. Sci. USA, 93:10614). Useful hybridization conditions 

30 are also provided in, e.g., Tijessen, 1 993, Hybridization With Nucleic Acid Probes, Elsevier 
Science Publishers B. V. and Kricka, 1992, Nonisotopic DNA Probe Techniques, Academic 
Press San Diego, Calif. 
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7. Signal Detection and Data Analysis 
(0135J When fluorescently labeled probes are used, the fluorescence emissions at each site 
of a transcript array can be detected by scanning confocal laser microscopy. Preferably the 
fluorescent intensities are measured by the Axon GenePix 4000 scanner. In one embodiment, 
5 a separate scan, using the appropriate excitation line, is carried out for each of the two 
fluorophores used. Alternatively, a laser can be used that allows simultaneous specimen 
illumination at wavelengths specific to the two fluorophores and emissions from the two 
fluorophores can be analyzed simultaneously (see Shalon et al., 1996, A DNA microarray 
system for analyzing complex DNA samples using two-color fluorescent probe hybridization, 

10 Genome Research 6:639-645, which is incorporated by reference in its entirety for all 

purposes). In a preferred embodiment, the arrays are scanned with a laser fluorescent scanner 
with a computer controlled X-Y stage and a microscope objective. Sequential excitation of 
the two fluorophores is achieved with a multi-line, mixed gas laser and the emitted light is 
split by wavelength and detected with two photomultiplier tubes. Fluorescence laser 

15 scanning devices are described in Schena et al., 1996, Genome Res. 6:639-645 and in other 
references cited herein. Alternatively, the fiber-optic bundle described by Ferguson et al., 
1996, Nature Biotech. 14:1681-1684, maybe used to monitor mRNA abundance levels at a 
large number of sites simultaneously. 

[0136] Signals are recorded and, in a preferred embodiment, analyzed by computer, e.g., 
20 using a 12 bit analog to digital board. In one embodiment the scanned image is despeckled 
using a graphics program (e.g., Hijaak Graphics Suite) and then analyzed using an image 
gridding program that creates a spreadsheet of the average hybridization at each wavelength 
at each site. If necessary, an experimentally determined correction for "cross talk" (or 
overlap) between the channels for the two fluors may be made. In a preferred embodiment, 
25 the fluorescent intensities were analyzed by the GenePix Pro 3.0 software to subtract the 

background signals. The expression data were then filtered based on their channel intensities, 
spots size and flag (missing data) , and the Cy5/Cy3 ratios were calculated and normalized by 
median-centering the log-ratio of all genes in each array. For any particular hybridization site 
on the transcript array, a ratio of the emission of the two fluorophores can be calculated. The 
30 ratio is independent of the absolute expression level of the cognate gene, but is useful for 

genes whose expression is significantly modulated by drug administration, gene deletion, or 
any other tested event. 
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[0137] According to the method of the invention, the relative abundance of an mRNA in 
two biological samples is scored as a perturbation and its magnitude determined (i.e., the 
abundance is different in the two sources of mRNA tested), or as not perturbed (i.e., the 
relative abundance is the same). In various embodiments, a difference between the two 
5 sources of RNA of at least a factor of about 25% (RNA from one source is 25% more 

abundant in one source than the other source), more usually about 50%, even more often by a 
factor of about 2 (twice as abundant), 3 (three times as abundant) or 5 (five times as 
abundant) is scored as a perturbation. 

[0138] Preferably, in addition to identifying a perturbation as positive or negative, it is 
10 advantageous to determine the magnitude of the perturbation. This can be carried out, as 
noted above, by calculating the ratio of the emission of the two fluorophores used for 
differential labeling, or by analogous methods that will be readily apparent to those of skill in 
the art. 

8. Pathway Response and Gene expression profiles 
1 5 [0139] In one embodiment of the present invention, gene expression profiles are 

determined by observing the gene expression profile of clinical sample of interest. In one 
embodiment of the invention, DNA microarrays reflecting the transcriptional state of a 
biological sample of interest are made by hybridizing a mixture of two differently labeled 
probes each corresponding (i.e., complementary) to the mRNA of a clinical sample of interest 
or a reference sample, to the microarray. According to the present invention, the two samples 
are of the same type, i.e., of the same species and tissue type, but may differ in clinical 
diagnosis. The genes whose expression are highly correlated may belong to a gene 
expression profile. 

[01 40] Further, it is preferable in order to reduce experimental error to reverse the 
25 fluorescent labels in two-color differential hybridization experiments to reduce biases 

peculiar to individual genes or array spot locations. In other words, it is preferable to first 
measure gene expression with one labeling (e.g., labeling perturbed cells with a first 
fluorochrome and unperturbed cells with a second fluorochrome) of the mRNA from the two 
cells being measured, and then to measure gene expression from the two cells with reversed 
30 labeling (e.g., labeling perturbed cells with the second fluorochrome and unperturbed cells 
with the first fluorochrome). Multiple measurements over exposure levels and perturbation 
control parameter levels provide additional experimental error control. With adequate 
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sampling a trade-off may be made when choosing the width of the spline function S used to 
interpolate response data between averaging of errors and loss of structure in the response 
functions. 

9. Other Methods of Transcriptional State Measurement 
5 [0141] The transcriptional state of a cell may be measured by other gene expression 
technologies known in the art. Several such technologies produce pools of restriction 
fragments of limited complexity for electrophoretic analysis, such as methods combining 
double restriction enzyme digestion with phasing primers (see, e.g., European Patent O 
534858 Al, filed Sep. 24, 1992, by Zabeau et al.), or methods selecting restriction fragments 
10 with sites closest to a defined mRNA end (see, e.g., Prashar et al., 1996, Proc. NatL Acad. 
Sci. USA 93:659-663). Other methods statistically sample cDNA pools, such as by 
sequencing sufficient bases (e.g., 20-50 bases) in each of multiple cDNAs to identify each 
cDNA, or by sequencing short tags (e.g., 9-10 bases) which are generated at known positions 
relative to a defined mRNA end (see, e.g, Velculescu, 1995, Science 270:484-487). 

15 10. Measurement of Other Aspects of Biological State 

[0142] In various embodiments of the present invention, aspects of the biological state 
other than the transcriptional state, such as the translational state, the activity state, or mixed 
aspects can be measured in order to obtain drug and pathway responses. Details of these 
embodiments are described infra. 

20 11. Embodiments Based on Translational State Measurements. 

[0143] Measurement of the translational state may be performed according to several 
methods. For example, whole genome monitoring of protein (i.e., the "proteome," Goffeau et 
al., supra) can be carried out by constructing a microarray in which binding sites comprise 
immobilized, preferably monoclonal, antibodies specific to a plurality of protein species 

25 encoded by the cell genome. Preferably, antibodies are present for a substantial fraction of 
the encoded proteins, or at least for those proteins relevant to the action of a drug of interest. 
Methods for making monoclonal antibodies are well known (see, e.g., Harlow and Lane, 
1988, Antibodies: A Laboratory Manual, Cold Spring Harbor, N.Y. which is incorporated in 
its entirety for all purposes). In a preferred embodiment, monoclonal antibodies are raised 

30 against synthetic peptide fragments designed based on genomic sequence of the cell. With 
such an antibody array, proteins from the cell are contacted to the array and their binding is 
assayed with assays known in the art. 
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(0144] Alternatively, proteins can be separated by two-dimensional gel electrophoresis 
systems. Two-dimensional gel electrophoresis is well-known in the art and typically involves 
iso-electric focusing along a first dimension followed by SDS-PAGE electrophoresis along a 
second dimension. See, e.g., Hames et at., 1990, Gel Electrophoresis of Proteins: A Practical 
Approach, ERL Press, New York; Shevchenko et al., 1996, Proc. Nat'l Acad. Sci. USA 
93:1440-1445; Sagliocco et al., 1996, Yeast 12:1519-1533; Lander, 1996, Science 274:536- 
539. The resulting electropherograms can be analyzed by numerous techniques, including 
mass spectrometric techniques, western blotting and immunoblot analysis using polyclonal 
and monoclonal antibodies, and internal and N-terminal micro-sequencing. Using these 
techniques, it is possible to identify a substantial fraction of all the proteins produced under 
given physiological conditions, including in cells (e.g., in yeast) exposed to a drug, or in cells 
modified by, e.g., deletion or over-expression of a specific gene. 

12. Embodiments Based on Other Aspects of the Biological State 

(01 45J Even though methods of this invention are illustrated by embodiments involving 
gene expression profiles, the methods of the invention are applicable to any cellular 
constituent that can be monitored. 

[0146] In particular, where activities of proteins relevant to the characterization of a 
perturbation, such as drug action, can be measured, embodiments of this invention can be 
based on such measurements. Activity measurements can be performed by any functional, 
biochemical, or physical means appropriate to the particular activity being characterized. 
Where the activity involves a chemical transformation, the cellular protein can be contacted 
with the natural substrate(s), and the rate of transformation measured. Where the activity 
involves association in multimeric units, for example association of an activated DNA 
binding complex with DNA, the amount of associated protein or secondary consequences of 
the association, such as amounts of mRNA transcribed, can be measured. Also, where only a 
functional activity is known, for example, as in cell cycle control, performance of the 
function can be observed. However known and measured, the changes in protein activities 
form the response data analyzed by the foregoing methods of this invention. 
[0147] In alternative and non-limiting embodiments, response data may be formed of 
mixed aspects of the biological state of a cell. Response data can be constructed from, e.g., 
changes in certain mRNA abundances, changes in certain protein abundances, and changes in 
certain protein activities. 
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II. Proteomic Analysis 

[0148] In another aspect, the invention provides methods for detecting markers which are 
differentially present in the samples of a metastatic HCC tumor or tissue samples of patients 
predisposed for HCC (e.g., patients at high risk for developing HCC but where the tumor is 
5 undetectable). The markers can be detected in a number of biological samples. The sample 
is preferably a biological tissue sample lysate. 

[0149] Any suitable methods can be used to detect one or more of the markers described 
herein. For example, gas phase ion spectrometry can be used. This technique includes, e.g., 
laser desorption/ionization mass spectrometry. Preferably, the sample is prepared prior to gas 
10 phase ion spectrometry, e.g., pre-fractionation, two-dimensional gel chromatography, high 

performance liquid chromatography, etc. to assist detection of markers. Detection of markers 
can be achieved using methods other than gas phase ion spectrometry. For example, 
immunoassays can be used to detect the markers in a sample. These detection methods are 
described in detail below. 

1 5 A. Detection by Gas Phase Ion Spectrometry 

[0150] Markers present in a biological sample can be detected using gas phase ion 
spectrometry, and preferably, mass spectrometry. In one embodiment, matrix-assisted laser 
desorption/ionization ("MALDI") mass spectrometry can be used. In another embodiment, 
surface-enhanced laser desorption/ionization mass spectrometry ("SELDI") can be used. 

20 1. Preparation of a Sample Prior to Gas Phase Ion Spectrometry 

[0151] One or combination of standard techniques well known in the art can be used to 
prepare a sample to further assist detection and characterization of markers in a sample. For 
example, a sample can be pre-fractionated to provide a less complex biological sample prior 
to gas phase ion spectrometry analysis using one or more of the following methods: size 

25 exclusion chromatography, Anion Exchange Chromatography, Affinity Chromatography, 
Sequential Extraction, Gel Electrophoresis, high performance liquid chromatography 
(HPLC). 

[0152] Optionally, a marker can be modified before analysis to improve its resolution or to 
determine its identity. For example, the markers may be subject to proteolytic digestion 
30 before analysis. Fragments from a digestion by a suitable protease, such as trypsin, may 
function as a fingerprint for the markers, thereby enabling their detection indirectly. 
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2. Contacting a Sample with a Substrate for Gas Phase Ion 
Spectrometry Analysis 
[0153] A biological sample can be contacted with a substrate, such as a spectrometer probe 
adapted for use with a gas phase ion spectrometer. Alternatively, a substrate can be a 
5 separate material that can be placed onto a spectrometer probe that is adapted for use with a 
gas phase ion spectrometer. 

101 54) A spectrometer probe can be in any suitable shape as long as it is adapted for use 
with a gas phase ion spectrometer (e.g., removably insertable into a gas phase ion 
spectrometer). The spectrometer probe substrate can be made of any suitable material, solid 
1 0 or porous. Spectrometer probes suitable for use in embodiments of the invention are 
described in, e.g., U.S. Patent No. 5,617,060 (Hutchens and Yip) and WO 98/59360 
(Hutchens and Yip). 

[0155] If complexity of a sample has been substantially reduced as described above, the 
sample can be contacted with any suitable substrate for gas phase ion spectrometry. Prior to 
1 5 gas phase ions spectrometry analysis, an energy absorbing molecule ("EAM") or a matrix 
material is typically applied to markers on the substrate surface. The energy absorbing 
molecule and the sample containing markers can be contacted in any suitable manner. 

[0156] Complexity of a sample can be further reduced using a substrate that comprises 
adsorbents capable of binding one or more markers. Adsorbents that bind the markers can be 
20 applied to the substrate in any suitable pattern (e.g., continuous or discontinuous), and a 
sample can be contacted with a substrate comprising an adsorbent in any suitable manner, 
e.g., bathing, soaking, dipping, spraying, washing over, or pipetting, etc. Following the 
contact, it is preferred that unbound materials on the substrate surface are washed out so that 
only the bound materials remain on the substrate surface. 

25 3. Desorption/Ionization and Detection 

[0157] Markers on the substrate surface can be desorbed and ionized using gas phase ion 
spectrometry. Any suitable gas phase ion spectrometers can be used as long as it allows 
markers on the substrate to be resolved. Preferably, gas phase ion spectrometers allow 
quantitation of markers. In one embodiment, the gas phase ion spectrometer is a mass 

30 spectrometer, preferably a laser desorption time-of-flight mass spectrometer. In another 
embodiment, an ion mobility spectrometer can be used to detect markers. In yet another 
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embodiment, a total ion current measuring device can be used to detect and characterize 
markers. 

4. Analysis of Data 
[0158] Data generated by desorption and detection of markers can be analyzed using any 
5 suitable means. In one embodiment, data sets are analyzed with the use of a programmable 
digital computer. The computer program generally contains a readable medium that stores 
codes. Certain code can be devoted to memory that includes the location of each feature on a 
spectrometer probe, the identity of the adsorbent at that feature and the elution conditions 
used to wash the adsorbent. The computer also contains code that receives as input, data on 
10 the strength of the signal at various molecular masses received from a particular addressable 
location on the spectrometer probe. These data can indicate the number of markers detected, 
including the strength of the signal generated by each marker. 

[0159] Data analysis can include the steps of determining signal strength (e.g., height of 
peaks) of a marker detected and removing "outerliers" (data deviating from a predetermined 

1 5 statistical distribution). The observed peaks can be normalized, a process whereby the height 
of each peak relative to some reference is calculated. For example, a reference can be 
background noise generated by instrument and chemicals (e.g., energy absorbing molecule) 
which is set as zero in the scale. Then the signal strength detected for each marker or other 
biomolecules can be displayed in the form of relative intensities in the scale desired (e.g., 

20 100). Alternatively, a standard (e.g., a serum protein) may be admitted with the sample so 
that a peak from the standard can be used as a reference to calculate relative intensities of the 
signals observed for each marker or other markers detected. 

[0160] The computer can transform the resulting data into various formats for displaying. 
In one format, referred to as "spectrum view or retentate map," a standard spectral view can 

25 be displayed, wherein the view depicts the quantity of marker reaching the detector at each 
particular molecular weight. In another format, referred to as "peak map," only the peak 
height and mass information are retained from the spectrum view, yielding a cleaner image 
and enabling markers with nearly identical molecular weights to be more easily seen. In yet 
another format, referred to as "gel view," each mass from the peak view can be converted 

30 into a grayscale image based on the height of each peak, resulting in an appearance similar to 
bands on electrophoretic gels. In yet another format, referred to as "3-D overlays," several 
spectra can be overlaid to study subtle changes in relative peak heights. In yet another 
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format, referred to as "difference map view," two or more spectra can be compared, 
conveniently highlighting unique markers and markers which are up- or down-regulated 
between samples. Marker profiles (spectra) from any two samples may be compared 
visually. In yet another format, Spotfire Scatter Plot can be used, wherein markers that are 
detected are plotted as a dot in a plot, wherein one axis of the plot represents the apparent 
molecular of the markers detected and another axis represents the signal intensity of markers 
detected. For each biological sample, markers that are detected and the amount of markers 
present in the biological sample can be saved in a computer readable medium. These data 
can then be compared to a control (e.g., a profile or quantity of markers detected in control, 
e.g., patients in whom metastatic HCC or tissue samples of someone predisposed for HCC is 
undetectable). 

[0161 ] A method for predicting the potential of developing metastasis in an HCC patient or 
developing HCC in a patient with chronic liver disease can be embodied by code that is 
executed by a digital computer capable of processing data sets derived from signals from 
arrays after contact with patient samples. The code can be executed by the digital computer 
to created an analytical model. The code may be stored on any suitable computer readable 
media. Examples of computer readable media include magnetic, electronic, or optical disks, 
tapes, sticks, chips, etc. The code may also be written in any suitable computer programming 
language including, visual basis, Fortran, C, C ++ , etc. The digital computer may be a micro, 
mini.or large frame computer using any standard or specialized operating system such as a 
Windows™ based operating system. A standard PC (personal computer) could be used to 
perform the analytical methods according to embodiments of the invention. 

B. Detection by Immunoassay 
[0162] An immunoassay can be used to detect and analyze markers in a sample. This 
method comprises: (a) providing an antibody that specifically binds to a marker; (b) 
contacting a sample with the antibody; and (c) detecting the presence of a complex of the 
antibody bound to the marker in the sample. 

[0163] Methods for producing polyclonal and monoclonal antibodies that react specifically 
with a cellular marker are known to those of skill in the art. See, e.g., Coligan, Current 
Protocols in Immunology (1991); Harlow & Lane, Antibodies: A Laboratory Manual (1988); 
Goding, Monoclonal Antibodies: Principles and Practice (2d ed. 1986); and Kohler & 
Milstein, Nature 256:495-497 (1975). For example, to produce polyclonal antibodies, a 
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purified target protein, is mixed with an adjuvant and used to immunize animals. When high 
titers of antibody to the target protein are obtained, blood is collected from the animals and 
antisera are prepared for immunoassays. To produce monoclonal antibodies, spleen cells 
from an animal immunized with a target protein are immortalized, commonly by fusion with 
5 a myeloma cell (see, Kohler and Milstein, Eur. J. Immunol., 6:51 1-519, 1976). Colonies 

arising from single immortalized cells are screened for production of antibodies of the desired 
specificity and affinity for the target protein. 

[0164] If the markers are not known proteins in the databases, nucleic acid and amino acid 
sequences can be determined with knowledge of even a portion of the amino acid sequence of 

10 the marker. For example, degenerate probes can be made based on the N-terminal amino 
acid sequence of the marker. These probes can then be used to screen a genomic or cDNA 
library created from a sample from which a marker was initially detected. The positive 
clones can be identified, amplified, and their recombinant DNA sequences can be subcloned 
using techniques which are well known. See, e.g., Ausubel et ah, Current Protocols for 

1 5 Molecular Biology, 1994 and Sambrook and Russell, supra. Based on the polynucleotide 
sequence encoding a marker, antibodies against the marker can be prepared using any 
suitable methods known in the art. See, e.g., Huse et ai, Science 246:1275-1281 (1989); 
Ward et ai, Nature 341:544-546 (1989). 

[01 65] After the antibody is provided, a marker can be detected and/or quantified using any 
20 of suitable immunological binding assays known in the art (see, e.g., U.S. Patent Nos. 

4,366,241; 4,376,1 10; 4,517,288; and 4,837,168). Useful assays include, for example, an 
enzyme immune assay (EIA) such as enzyme-linked immunosorbent assay (ELISA), a 
radioimmune assay (RIA), a Western blot assay, or a slot blot assay. These methods are also 
described in, e.g., Methods in Cell Biology: Antibodies in Cell Biology, volume 37 (Asai, ed. 
25 1993); Basic and Clinical Immunology (Stites & Terr, eds., 7th ed. 1991); and Harlow & 
Lane, supra. 

C. Diagnosis of Metastatic HCC or the Predisposition to Develop HCC 
[0166] In another aspect, the present invention provides methods for aiding a diagnosis of 
the probability of developing metastatic tumors in an HCC patient or a predispositon for 
30 developing HCC in a patient with a severe liver disease using one or more markers identified 
in Tables 2-7. Although valid diagnoses can be made based on as few as one marker selected 
from the markers in Tables 2-7, it is preferred that multiple markers are used to achieve more 
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reliable results. Preferably, at least 10 cellular markers of Table 2 should be included in the 
set of markers used to predict an HCC patient's metastatic potential, for example, more 
preferably at least 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, or 100, and most preferably all 153 
markers of Table 2 should be included in the markers used. Similarly, preferably at least 10, 
5 more preferably at least 15, 20, 25, 30, 40, 50, 60, 70, 80, 90, or 100, and most preferably all 
273 genes of Table 5 should be included in the markers used for determining the risk of 
developing HCC in a patient with a chronic liver disease. The markers identified in Tables 2- 
7 can be used alone, in combination with other markers in any of the Tables, or with entirely 
different markers in aiding in the diagnosis of developing Metastatic HCC or a predisposition 

1 0 for developing HCC by a patient with a severe liver disease. The markers in Tables 2-7 are 
differentially present in samples of a Metastatic HCC or tissue samples of someone 
predisposed for HCC relative to a non-metastatic HCC or a subject not predisposed for HCC 
respectively. For example, some of the markers are expressed at an elevated level and/or are 
present at a higher frequency in metastatic HCC or tissue samples of someone predisposed 

1 5 for HCC relative to patients with non-metastatic HCC or individuals at low risk for 

developing HCC. Therefore, detection of one or more of these markers in a person would 
provide useful information regarding the probability that the person may develop Metastatic 
HCC or be predisposed to develop HCC. 

[01 67) Accordingly, embodiments of the invention include methods for aiding in 
20 diagnosing the probability of developing Metastatic HCC or in diagnosing the probability of 
a patient with a severe liver disease developing HCC, wherein the method comprises: (a) 
detecting at least one marker in a sample, wherein the marker is selected from the markers 
identified in Tables 2-7; and (b) correlating the detection of the marker or markers with a 
diagnosis of metastatic HCC or the probability for a liver disease patient to develop HCC. 
25 The correlation may take into account the amount of the marker or markers in the sample 
compared to a control amount of the marker or markers (e.g., a non-metastatic HCC or a 
subject not predisposed for HCC). The correlation may take into account the presence or 
absence of the markers in a test sample and the frequency of detection of the same markers in 
a control. The correlation may take into account both of such factors to facilitate 
30 determination of whether a subject has a metastatic HCC or has a sever liver disease that will 
likely lead to HCC. 
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[0168] Any suitable samples can be obtained from a subject to detect markers. Preferably, 
a sample is a liver tissue sample from the subject. If desired, the sample can be prepared as 
described above to enhance detectability of the markers. 

[0169] Any suitable method can be used to detect a marker or markers in a sample. For 
5 example, gas phase ion spectrometry or an immunoassay can be used as described above. 

Using these methods, one or more markers can be detected. Preferably, a sample is tested for 
the presence of a plurality of markers. Detecting the presence of a plurality of markers, rather 
than a single marker alone, would provide more information for the diagnostician. 
Specifically, the detection of a plurality of markers in a sample would increase the percentage 
1 0 of true positive and true negative diagnoses and would decrease the percentage of false 
positive or false negative diagnoses. 

10170) The detection of the marker or markers is then correlated with a probable diagnosis 
of developing metastatic HCC or a predispositon for developing HCC by a patient with a 
severe liver disease. In some embodiments, the detection of the mere presence or absence of 
1 5 a marker, without quantifying the amount of marker, is useful and can be correlated with a 
probable diagnosis of developing metastatic HCC or a predispositon for developing HCC by 
a patient with a severe liver disease. 

[0171] In other embodiments, the detection of markers can involve quantifying the markers 
to correlate the detection of markers with a probable diagnosis of developing metastatic HCC 
20 or a predispositon for developing HCC by a patient with severe liver disease. For example, 
increased levels of OPN are observed in patients with metastatic HCC. Thus, if the amount 
of the markers detected in a subject being tested is higher compared to a control amount, then 
the subject being tested has a higher probability of developing metastatic HCC or a 
predispositon for developing HCC by a patient with a severe liver disease. 

25 [0172] When the markers are quantified, it can be compared to a control. A control can be, 
e.g., the average or median amount of marker present in comparable samples of normal 
subjects not predisposed to developing metastatic HCC or not predisposed to developing 
HCC by a patient with severe liver disease. The control amount is measured under the same 
or substantially similar experimental conditions as in measuring the test amount. For 

30 example, if a test sample is obtained from a subject's blood serum sample and a marker is 
detected using a particular probe, then a control amount of the marker is preferably 
determined from a serum sample of a patient using the same probe. It is preferred that the 

48 

BNSDOCID: <WO 03O87766A2_t_> 



WO 03/087766 



PCT/US03/10783 



control amount of marker is determined based upon a significant number of samples from 
normal subjects who do not have metastatic HCC or tissue samples of someone not 
predisposed for HCC so that it reflects variations of the marker amounts in that population. 
[0173J Data generated by mass spectrometry can then be analyzed by a computer software. 
5 The software can comprise code that converts signal from the mass spectrometer into 

computer readable form. The software also can include code that applies an algorithm to the 
analysis of the signal to determine whether the signal represents a "peak" in the signal 
corresponding to a marker of this invention, or other useful markers. The software also can 
include code that executes an algorithm that compares signal from a test sample to a typical 
1 0 signal characteristic of "normal" and metastatic HCC or a predispositon for developing HCC 
by a patient with severe liver disease and determines the closeness of fit between the two 
signals. The software also can include code indicating which the test sample is closest to, 
thereby providing a probable diagnosis. 

HI. Regulation of the Biological Activity of Therapeutic Targets 
[01 741 Ostoepontin (OPN) and EpCAM have been positively correlated to metastasis in an 
HCC patient and onset of HCC in a patient with a chronic liver disease, respectively. 
Therefore, it is one objective of this invention to identify compounds that regulate, 
particularly inhibit, the activity of OPN or EpCAM. 

A. Assays for Biological Functions 
[01751 OPN and its alleles and polymorphic variants are secreted phosphoproteins encoded 
by SEQ ID NO: 1 and whose amino acid sequence is disclosed in SEQ ED NO:2. The activity 
of OPN polypeptides can be assessed using a variety of in vitro and in vivo assays to 
determine its functional, chemical, and physical effects, e.g., measuring receptor binding 
(e.g., radioactive receptor binding), and the like. Further downstream events, such as altered 
cellular events including cell proliferation, differentiation, etc. may also be used as indirect 
indicators of modified OPN activity. In addition, such assays can be used to test and screen 
for antagonists of OPN activity. Antagonists can also be genetically altered versions of OPN, 
e.g., a dominant negative version of the protein. Such antagonists of OPN activity are useful 
for treating metastatic HCC. 

[01 76) The OPN of the assay will be selected from a polypeptide having a sequence of 
SEQ ID NO: 2 or a conservatively modified variant or fragment thereof. Generally, the 
amino acid sequence identity will be at least 70%, optionally at least 85%, optionally at least 
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90-95%. Optionally, the polypeptide of the assays will comprise a domain of OPN, such as a 
receptor binding domain, an extracellular matrix binding domain, and the like. Either OPN 
or a domain thereof can be covalently linked to a heterologous protein to create a chimeric 
protein used in the assays described herein. 

5 [01 77] Modulators of OPN activity are tested using OPN polypeptides as described above, 
either recombinant or naturally occurring. The protein can be isolated, expressed in a cell, 
secreted from a cell, expressed in tissue or in an animal, either recombinant or naturally 
occurring. For example, liver slices, dissociated liver cells, or transformed cells can be used.. 
OPN antagonism is tested using one of the in vitro or in vivo assays described herein. 
1 0 Furthermore, receptor-binding domains of the OPN protein can be used in vitro in soluble or 
solid state reactions to assay for receptor binding. 

[0178] Receptor binding to OPN, a domain, or chimeric protein can be tested in solution, in 
a bilayer membrane, attached to a solid phase, in a lipid monolayer, or in vesicles. Binding 
of an antagonist can be tested using, e.g., changes in spectroscopic characteristics (e.g., 
15 fluorescence, absorbance, refractive index) hydrodynamic (e.g., shape), chromatographic, or 
solubility properties. 

[0179] Samples or assays that are treated with a potential OPN inhibitor are compared to 
control samples without the test compound, to examine the extent of antagonism. Control 
samples (untreated with inhibitors) are assigned a relative OPN activity value of 100. 
20 Antagonism of OPN is achieved when the OPN activity value relative to the control is about 
90%, optionally 50%, optionally 25-0%. 

[0180] Changes in OPN receptor binding may be assessed by determining changes in the 
ability of the vitronectin receptor to bind OPN in the presence of the antagonist. Generally, 
the compounds to be tested are present in the range from 1 pM to 100 mM. 

25 [0181] The effects of the test compounds upon the function of the polypeptides can be 

measured by examining any of the parameters described above. Any suitable physiological 
change that affects OPN activity can be used to assess the influence of a test compound on 
the polypeptides of this invention. When the functional consequences are determined using 
intact cells or animals, one can also measure a variety of effects such as transcriptional 

30 changes to both known and uncharacterized genetic markers (e.g., northern blots), changes in 
cell metabolism such as cell growth or pH changes. 

50 

BNSDOCID: <WO 03087766A2_I_> 



WO 03/087766 



PCT/US03/10783 



[0182] Similarly, the biological functions of EpCAM may be monitored based on the same 
general principles and methodologies as described above. For instance, EpCAM is known to 
play a role in epithelial cell homotypic adhesion, relying on both its extracellular and 
intracellular domains for proper functioning. Thus, EpCAM's functions can be examined 
5 based on, e.g., cell aggregation, specific interactions with its known binding partners {e.g., 
with actin via its intracellular domain), and disruption of signal transduction it is known to 
mediate. Various cellular events may serve as indicators of EpCAM activity and to facilitate 
screening test compounds for EpCAM antagonists. 

B. Antagonists 

10 [0183] The compounds tested as antagonists of OPN or EpCAM can be any small chemical 
compound, or a biological entity, such as a protein, sugar, nucleic acid or lipid. Various 
antibodies against the proteins are likely candidates for antagonists. For example, many 
monoclonal antibodies, such as 17-1 A and GA733, are known to specifically bind EpCAM 
and can thus be tested in appropriate assays for their ability to interfere with EpCAM's 

1 5 biological functions. 

[0184] Alternatively, antagonists can be genetically altered versions of OPN or EpCAM, 
such as a so-called "dominant negative" version, a biologically inactive version that 
suppresses the normal function of its wild type counterpart by competing for limited binding 
partners. Typically, test compounds will be small chemical molecules and peptides. 

20 Essentially any chemical compound can be used as a potential antagonist in the assays of the 
invention, although most often compounds can be dissolved in aqueous or organic (especially 
DMSO-based) solutions are used. The assays are designed to screen large chemical libraries 
by automating the assay steps and providing compounds from any convenient source to 
assays, which are typically run in parallel (e.g., in microtiter formats on microtiter plates in 

25 robotic assays). It will be appreciated that there are many suppliers of chemical compounds, 
including Sigma (St. Louis, MO), Aldrich (St. Louis, MO), Sigma-Aldrich (St. Louis, MO), 
Fluka Chemika-Biochemica Analytika (Buchs Switzerland) and the like. 

[0185] In one preferred embodiment, high throughput screening methods involve providing 
a combinatorial chemical or peptide library containing a large number of potential therapeutic 
30 compounds (potential modulator or ligand compounds). Such "combinatorial chemical 

libraries" or "ligand libraries" are then screened in one or more assays, as described herein, to 
identify those library members (particular chemical species or subclasses) that display a 
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desired characteristic activity. The compounds thus identified can serve as conventional 
"lead compounds" or can themselves be used as potential or actual therapeutics. 

[0186] A combinatorial chemical library is a collection of diverse chemical compounds 
generated by either chemical synthesis or biological synthesis, by combining a number of 
5 chemical "building blocks" such as reagents. For example, a linear combinatorial chemical 
library such as a polypeptide library is formed by combining a set of chemical building 
blocks (amino acids) in every possible way for a given compound length (i.e., the number of 
amino acids in a polypeptide compound). Millions of chemical compounds can be 
synthesized through such combinatorial mixing of chemical building blocks. 

10 [0187] Preparation and screening of combinatorial chemical libraries is well known to 

those of skill in the art. Such combinatorial chemical libraries include, but are not limited to, 
peptide libraries {see, e.g., U.S. Patent 5,010,175; Furka, Int. J. Pept. Prot. Res. 37:487-493, 
1991 ; and Houghton et al, Nature 354:84-88, 1991). Other chemistries for generating 
chemical diversity libraries can also be used. Such chemistries include, but are not limited to: 

1 5 peptoids (e.g., PCT Publication No. WO 91/19735), encoded peptides (e.g., PCT Publication 
WO 93/20242), random bio-oligomers (e.g., PCT Publication No. WO 92/00091), 
benzodiazepines (e.g., U.S. Pat. No. 5,288,514), diversomers such as hydantoins, 
benzodiazepines and dipeptides (Hobbs et al, Proc. Nat. Acad. Sci. USA 90:6909-6913, 
1993), vinylogous polypeptides (Hagihara et al,J. Amer. Chem. Soc. 114:6568, 1992), 

20 nonpeptidal peptidomimetics with glucose scaffolding (Hirschmann et al., J. Amer. Chem. 
Soc. 114:9217-9218, 1992), analogous organic syntheses of small compound libraries (Chen 
et al,J. Amer. Chem. Soc. 116:2661, 1994), oligocarbamates (Cho et al, Science 261:1303, 
1993), and/or peptidyl phosphonates (Campbell et al., J. Org. Chem. 59:658, 1994), nucleic 
acid libraries (see Ausubel, Berger and Sambrook, all supra), peptide nucleic acid libraries 

25 (see, e.g., U.S. Patent 5,539,083), antibody libraries (see, e.g., Vaughn et al., Nature 

Biotechnology, 14(3):309-314, 1996 and PCT/US96/10287), carbohydrate libraries (see, e.g., 
Liang et al, Science 274:1520-1522, 1996 and U.S. Patent 5,593,853), small organic 
molecule libraries (see, e.g., benzodiazepines, Baum C&EN, Jan 18, page 33, 1993; 
isoprenoids, U.S. Patent 5,569,588; thiazolidinones and metathiazanones, U.S. Patent 

30 5,549,974; pyrrolidines, U.S. Patents 5,525,735 and 5,519,1 34; morpholino compounds, U.S. 
Patent 5,506,337; benzodiazepines, 5,288,514, and the like). 
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[01 88] Devices for the preparation of combinatorial libraries are commercially available 
(see, e.g., 357 MPS, 390 MPS, Advanced Chem Tech, Louisville KY, Symphony, Rainin, 
Wobum, MA, 433A Applied Biosystems, Foster City, CA, 9050 Plus, Millipore, Bedford, 
MA). In addition, numerous combinatorial libraries are themselves commercially available 
5 (see, e.g., ComGenex, Princeton, N.J., Tripos, Inc., St. Louis, MO, 3D Pharmaceuticals, 
Exton, PA, Martek Biosciences, Columbia, MD, etc.). 

C. Solid State and soluble high throughput assays 
[01 89) In one embodiment the invention provide soluble assays using molecules such as a 
domain such as a receptor binding domain, an extracellular matrix binding domain, etc.; a 
1 0 domain that is covalently linked to a heterologous protein to create a chimeric molecule; OPN 
or EpCAM; or a cell or tissue expressing OPN or EpCAM, either naturally occurring or 
recombinant. In another embodiment, the invention provides solid phase based in vitro 
assays in a high throughput format, where the domain, chimeric molecule, OPN or EpCAM, 
or cell or tissue expressing OPN or EpCAM is attached to a solid phase substrate. 

1 5 [01 90] In the high throughput assays of the invention, it is possible to screen up to several 
thousand different antagonists or ligands in a single day. In particular, each well of a 
microtiter plate can be used to run a separate assay against a selected potential modulator, or, 
if concentration or incubation time effects are to be observed, every 5-10 wells can test a 
single modulator. Thus, a single standard microtiter plate can assay about 100 (e.g., 96) 

20 modulators. If 1 536 well plates are used, then a single plate can easily assay from about 100- 
about 1 500 different compounds. It is possible to assay several different plates per day; assay 
screens for up to about 6,000-20,000 different compounds is possible using the integrated 
systems of the invention. More recently, microfluidic approaches to reagent manipulation 
have been developed, e.g., by Caliper Technologies (Palo Alto, CA). 

25 [0191] The molecule of interest can be bound to the solid state component, directly or 

indirectly, via covalent or non covalent linkage e.g., via a tag. The tag can be any of a variety 
of components. In general, a molecule which binds the tag (a tag binder) is fixed to a solid 
support, and the tagged molecule of interest (e.g., the signal transduction molecule of 
interest) is attached to the solid support by interaction of the tag and the tag binder. 

30 [0192] A number of tags and tag binders can be used, based upon known molecular 

interactions well described in the literature. For example, where a tag has a natural binder, 
for example, biotin, protein A, or protein G, it can be used in conjunction with appropriate tag 
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binders (avidin, streptavidin, neutravidin, the Fc region of an immunoglobulin, etc.) 
Antibodies to molecules with natural binders such as biotin are also widely available and 
appropriate tag binders; see, SIGMA Immunochemicals 1998 catalogue SIGMA, St. Louis 
MO). 

5 (0193] Similarly, any haptenic or antigenic compound can be used in combination with an 
appropriate antibody to form a tag/tag binder pair. Thousands of specific antibodies are 
commercially available and many additional antibodies are described in the literature. For 
example, in one common configuration, the tag is a first antibody and the tag binder is a 
second antibody which recognizes the first antibody. In addition to antibody- antigen 

10 interactions, receptor-ligand interactions are also appropriate as tag and tag-binder pairs. For 
example, agonists and antagonists of cell membrane receptors (e.g., cell receptor-ligand 
interactions such as transferrin, c-kit, viral receptor ligands, cytokine receptors, chemokine 
receptors, interleukin receptors, immunoglobulin receptors and antibodies, the cadherein 
family, the integrin family, the selectin family, and the like; see, e.g., Pigott & Power, The 

1 5 Adhesion Molecule Facts Book I (1 993). Similarly, toxins and venoms, viral epitopes, 

hormones (e.g., opiates, steroids, etc.), intracellular receptors (e.g. which mediate the effects 
of various small ligands, including steroids, thyroid hormone, retinoids and vitamin D; 
peptides), drugs, lectins, sugars, nucleic acids (linear or cyclic polymer configurations), 
oligosaccharides, proteins, phospholipids, and antibodies can all interact with various cell 

20 receptors. 

[0194] Synthetic polymers, such as polyurethanes, polyesters, polycarbonates, polyureas, 
polyamides, polyethyleneimines, polyarylene sulfides, polysiloxanes, polyimides, and 
polyacetates can also form an appropriate tag or tag binder. Many other tag/tag binder pairs 
are also useful in assay systems described herein, as would be apparent to one of skill upon 
25 review of this disclosure. 

[0195] Common linkers such as peptides, polyethers, and the like can also serve as tags, 
and include polypeptide sequences, such as poly gly sequences of between about 5 and 200 
amino acids. Such flexible linkers are known to persons of skill in the art. For example, 
poly(ethelyne glycol) linkers are available from Shearwater Polymers, Inc. Huntsville, 
30 Alabama. These linkers optionally have amide linkages, sulfhydryl linkages, or 
heterofunctional linkages. 
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[0196] Tag binders are fixed to solid substrates using any of a variety of methods currently 
available. Solid substrates are commonly derivatized or fiinctionalized by exposing all or a 
portion of the substrate to a chemical reagent which fixes a chemical group to the surface 
which is reactive with a portion of the tag binder. For example, groups which are suitable for 
attachment to a longer chain portion would include amines, hydroxyl, thiol, and carboxyl 
groups. Aminoalkylsi lanes and hydroxyalkylsilanes can be used to functionalize a variety of 
surfaces, such as glass surfaces. The construction of such solid phase biopolymer arrays is 
well described in the literature. See, e.g., Merrifield, J. Am. Chem. Soc. 85:2149-2154 (1963) 
(describing solid phase synthesis of, e.g., peptides); Geysen et ai t J. Immun. Meth. 102:259- 
274 (1987) (describing synthesis of solid phase components on pins); Frank & Doring, 
Tetrahedron 44:60316040 (1988) (describing synthesis of various peptide sequences on 
cellulose disks); Fodor et al., Science, 251:161-111 (1991); Sheldon et al., Clinical Chemistry 
39(4):71 8-719 (1993); and Kozal et al, Nature Medicine 2(7):753759 (1996) (all describing 
arrays of biopolymers fixed to solid substrates). Non-chemical approaches for fixing tag 
binders to substrates include other common methods, such as heat, cross-linking by UV 
radiation, and the like. 

D. Computer-based assays 
[0197] Yet another approach to screen for compounds that modulate OPN or EpCAM 
activity involves computer assisted drug design, in which a computer system is used to 
generate a three-dimensional structure of OPN or EpCAM based on the structural information 
encoded by the amino acid sequence. The input amino acid sequence interacts directly and 
actively with a pre-established algorithm in a computer program to yield secondary, tertiary, 
and quaternary structural models of the protein. The models of the protein structure are then 
examined to identify regions of the structure that have the ability to bind, e.g., ligands. These 
regions are then used to identify ligands that bind to the protein. 

[0198] The three-dimensional structural model of the protein is generated by entering 
protein amino acid sequences of at least 10 amino acid residues or corresponding nucleic acid 
sequences encoding an OPN or EpCAM polypeptide into the computer system. For example, 
the amino acid sequence of an OPN polypeptide or the nucleic acid encoding the polypeptide 
is selected from the group consisting of SEQ ID NOS:l or 2, and conservatively modified 
versions thereof The amino acid sequence represents the primary sequence or subsequence 
of the protein, which encodes the structural information of the protein. At least 10 residues of 
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the amino acid sequence (or a nucleotide sequence encoding 10 amino acids) are entered into 
the computer system from computer keyboards, computer readable substrates that include, 
but are not limited to, electronic storage media (e.g., magnetic diskettes, tapes, cartridges, and 
chips), optical media (e.g., CD ROM), information distributed by internet sites, and by RAM. 
5 The three-dimensional structural model of the protein is then generated by the interaction of 
the amino acid sequence and the computer system, using software known to those of skill in 
the art. 

[0199] The amino acid sequence represents a primary structure that encodes the 
information necessary to form the secondary, tertiary and quaternary structure of the protein 

10 of interest. The software looks at certain parameters encoded by the primary sequence to 
generate the structural model. These parameters are referred to as "energy terms," and 
primarily include electrostatic potentials, hydrophobic potentials, solvent accessible surfaces, 
and hydrogen bonding. Secondary energy terms include van der Waals potentials. 
Biological molecules form the structures that minimize the energy terms in a cumulative 

1 5 fashion. The computer program is therefore using these terms encoded by the primary 
structure or amino acid sequence to create the secondary structural model. 

[0200] The tertiary structure of the protein encoded by the secondary structure is then 
formed on the basis of the energy terms of the secondary structure. The user at this point can 
enter additional variables such as whether the protein is membrane bound or soluble, its 
20 location in the body, and its cellular location, e.g., cytoplasmic, surface, or nuclear. These 
variables along with the energy terms of the secondary structure are used to form the model 
of the tertiary structure. In modeling the tertiary structure, the computer program matches 
hydrophobic faces of secondary structure with like, and hydrophilic faces of secondary 
structure with like. 

25 [0201] Once the structure has been generated, potential ligand binding regions are 

identified by the computer system. Three-dimensional structures for potential ligands are 
generated by entering amino acid or nucleotide sequences or chemical formulas of 
compounds, as described above. The three-dimensional structure of the potential ligand is 
then compared to that of the OPN or EpCAM protein to identify ligands that bind to OPN or 

30 EpCAM. Binding affinity between the protein and ligands is determined using energy terms 
to determine which ligands have an enhanced probability of binding to the protein. 
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[0202) Computer systems are also used to screen for mutations, polymorphic variants, 
alleles and interspecies homologs of OPN genes or EpCAM genes. Such mutations can be 
associated with disease states or genetic traits. As described above, GENECHIP® and 
related technology can also be used to screen for mutations, polymorphic variants, alleles, 
and interspecies homologs. Once the variants are identified, diagnostic assays can be used to 
identify patients having such mutated genes. Identification of the mutated OPN genes, for 
example, involves receiving input of a first amino acid or nucleic acid sequence encoding 
OPN, selected from the group consisting of SEQ ID NOS:l and 2, and conservatively 
modified versions thereof. The sequence is entered into the computer system as described 
above. The first nucleic acid or amino acid sequence is then compared to a second nucleic 
acid or amino acid sequence that has substantial identity to the first sequence. The second 
sequence is entered into the computer system in the manner described above. Once the first 
and second sequences are compared, nucleotide or amino acid differences between the 
sequences are identified. Such sequences can represent allelic differences in OPN genes, and 
mutations associated with disease states and genetic traits. The same general strategy is also 
applicable for detecting EpCAM variants and mutants. 

E. Kits 

[0203] A protein of interest and its homologs are a useful tool for identifying its 
antagonists. For instance, OPN-specific reagents that specifically hybridize to OPN nucleic 
acid, such as OPN probes and primers, and OPN specific reagents that specifically bind to the 
OPN protein, e.g., OPN antibodies are used to examine liver cell expression, signal 
transduction regulation and diagnose metastatic HCC. The same general methods are 
applicable to EpCAM as well. 

[0204] Nucleic acid assays for the presence and the quantity of OPN or EpCAM 
25 polynucleotides in a sample include numerous techniques well known to those skilled in the 
art, such as Southern blot analysis, northern blot analysis, dot blots, RNase protection, Si 
analysis, amplification techniques such as PCR (including RT-PCR) and LCR, and in situ 
hybridization. In in situ hybridization, for example, the target nucleic acid, e.g., nucleic acid 
encoding OPN, is liberated from its cellular surroundings in such as to be available for 
30 hybridization within the cell while preserving the cellular morphology for subsequent 

interpretation and analysis (see Example 1). The following articles provide an overview of 
the art of in situ hybridization: Singer el al, Biotechniques 4:230-250 (1986); Haase et ai. 
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Methods in Virology, vol. VII, pp. 189-226 (1984); and Nucleic Acid Hybridization: A 
Practical Approach (Hames et al, eds. 1987). In addition, OPN or EpCAM protein can be 
detected with the various immunoassay techniques described above. The test sample is 
typically compared to both a positive control {e.g., a sample containing recombinant OPN or 
5 EpCAM) and a negative control. 

[0205) The present invention also provides for kits for screening for modulators of OPN or 
EpCAM. Such kits can be prepared from readily available materials and reagents. For 
example, such kits can comprise any one or more of the following materials: OPN (or 
EpCAM), reaction tubes, and instructions for testing OPN (or EpCAM) activity. Optionally, 
10 the kit contains biologically active OPN (or EpCAM). A wide variety of kits and 

components can be prepared according to the present invention, depending upon the intended 
user of the kit and the particular needs of the user. 

IV. Inhibition of the Expression of Therapeutic Targets 

[0206] Another means of inhibiting OPN activity and thereby inhibiting HCC metastasis in 
1 5 an HCC patient is to inhibit OPN expression. Similarly, reduced risk of developing HCC in a 
patient of a chronic liver disease may be achieved by inhibiting EpCAM expression. A 
variety of methods well known to those skilled in the art are available for specifically 
suppressing the expression of a particular gene, 

A. Antisense polynucleotides 
20 [0207] Antisense technology has been the most commonly described approach in protocols 
to achieve gene-specific inactivation and are useful tools in research and diagnostics. For 
instance, antisense oligonucleotides capable of inhibiting gene expression with high level of 
specificity are often used by those of ordinary skill in biological sciences to elucidate the 
function of particular genes. 

25 [0208] The specificity and sensitivity of antisense polynucleotides also make them suitable 
for therapeutic uses. A large number of U.S. patents and scientific publications relate to the 
use of antisense oligonucleotides as therapeutic agents in the treatment of diseases in animals 
and humans. See, e.g, U.S. Patent Nos. 6,080,580; 6,180,403; 6,255,11 1; 6,306,655; 
6,440,739; and 6,524,854. An antisense oligonucleotide contains a sequence complementary 

30 to the coding strand of a gene targeted for inactivation (e.g., SEQ ED NO: 1 or SEQ ID NO:5) 
and may be of varying lengths, e.g., from less than 10 nucleotides to more than 100 
nucleotides, can be safely and effectively administered to a subject, e.g., a human. An 
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antisense polynucleotide may be an oligomer or a polymer of ribonucleic acid (RNA) or 
deoxyribonucleic acid (DNA) or mimetics thereof. It may be composed of naturally- 
occurring nucleobases, sugars and covalent intemucleoside (backbone) linkages as well as 
oligonucleotides having non-naturally-occurring portions that function similarly. Such 
5 modified or substituted antisense oligonucleotides are often preferred over native forms 
because of desirable properties such as, e.g., enhanced cellular uptake, enhanced affinity for 
nucleic acid target, and increased stability in the presence of nucleases. Antisense 
oligonucleotides suitable for the present invention may also include oligonucleotides 
containing modified backbones or non-natural intemucleoside linkages. Preferred modified 

1 0 oligonucleotide backbones include, for example, phosphorothioates, chiral 

phosphorothioates, phosphorodithioates, phosphotriesters, aminoalkylphosphotri-esters, 
methyl and other alkyl phosphonates including 3'-alkylene phosphonates and chiral 
phosphonates, phosphinates, phosphoramidates including 3'-amino phosphoramidate and 
aminoalkylphosphoramidates, thionophosphoramidates, thiono-alkylphosphonates, 

1 5 thionoalkylphosphotriesters, and borano-phosphates having normal 3-5' linkages, 2'-5' linked 
analogs of these, and those having inverted polarity wherein the adjacent pairs of nucleoside 
units are linked 3'-5' to 5'-3' or 2'-5' to 5'-2\ Various salts, mixed salts and free acid forms are 
also included. 

f0209J Furthermore, antisense oligonucleotides suitable for the present invention may 
20 correspond to either the coding region or the non-coding region of a target nucleic acid, e.g. , 
OPN or EpCAM. 

B. Ribozymes 

[02 1 0] The level of mRNA encoded by a gene of interest, e.g. , OPN or EpCAM, can also 
be reduced using ribozymes. Ribozymes are RNA molecules having an enzymatic activity 

25 that is capable of cleaving or splicing other separate RNA molecules in a nucleotide sequence 
specific manner. A ribozyme useful for practicing the present invention is a catalytic or 
enzymatic RNA molecule with complementarity in a substrate binding region to a specific 
RNA target, e.g., OPN or EpCAM mRNA, and also has enzymatic activity that is active to 
cleave and/or splice RNA in that target, thereby inhibiting the expression of the target gene. 

30 Methods for designing and using ribozymes to target a particular gene are known to those of 
skill in the art and described in numerous publications, including U.S. Patent Nos. 6.069,007; 
6,107,027; 6,225,291; 6,307,041; 6,482,803; and 6,489,163. 
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C. Small inhibitory RNA (siRNA) 
[0211] Another useful tool to reduce the level of a target mRNA and thus the level of a 
target protein is small inhibitory RNA (siRNA). siRNA molecules are small double-stranded 
RNA molecules that elicit a process known as RNA interference, a form of sequence-specific 

5 gene inactivation. A proposed mechanism for RNA interference hypothesizes an ATP- 

dependent cleavage of mRNA molecules activated by a short double-stranded RNA, which is 
formed between the mRNA and the antisense strand of siRNA. Zamore et al, Cell 1 01 :25- 
33, 2000. RNA interference has been shown in mammalian cell lines, oocytes, early 
embryos, and some cell types. See, e.g., Elbashir, Sayda M., et al, Nature 411:494-497, 
10 2001 . siRNA coding sequences can be designed based on the sequence of a target gene {e.g., 
OPN or EpCAM) and inserted into various suitable vectors, such as a plasmid or a viral 
vector, with properly placed transcription initiation and termination elements. When used in 
an intended recipient of eukaryotic origin, eukaryotic transcription control elements should be 
used. The vectors containing siRNA coding sequences can then be delivered to a desired 
15 target in accordance with the general methodologies for gene transfer known to those of skill 
in the art. RNA interference thus provides an alternative means to specifically inhibit the 
expression of a gene based on its sequence, by causing the rapid degradation of the mRNA of 
the gene, e.g., OPN or EpCAM. 

D. Detection of Reduced Target Gene Expression 
[021 2J Following the administration of a therapeutic compound containing an agent 
capable of inhibiting the expression of a target gene, e.g., OPN or EpCAM, the effectiveness 
of the therapeutic compound can be assessed by comparing the in vivo level of the target gene 
before and after the administration. The general methods for administering a pharmaceutical 
compound are described in detail in a later section. 

[0213] When the inhibition of gene expression is achieved at transcriptional level, i.e., by 
reduction of the amount of mRNA encoding a target gene, the diminished expression of the 
target gene may be confirmed using various detection techniques such as Northern blot 
assays, dot blot, RT-PCR and the like by comparing the mRNA level of the target gene (e.g., 
OPN or EpCAM) before and after the administration of a therapeutic compound. The general 
methodologies for performing such analysis are well known to those of ordinary skill in the 
art and described in various literature (see, e.g., Sambrook and Russell, supra and Ausubel et 
al., supra). 
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[0214] When the inhibition of gene expression is achieved at translational level, i.e., by 
reduction of the amount of protein encoded by a target gene, the diminished expression of the 
target gene may be confirmed by comparing the protein level of the target gene (e.g., OPN or 
EpCAM) before and after the administration of a therapeutic compound using various means 
of measuring protein levels in tissue samples are well known to the ordinarily skilled artisans. 
As mentioned above, various immunoassays are routinely used to detect the presence and 
quantity of a protein of interest, e.g., OPN or EpCAM. A general overview of the applicable 
technology can be found in Harlow and Lane, Antibodies, A Laboratory Manual, 1988. 

[0215] Appropriate antibodies for target proteins, e.g., OPN and EpCAM, will be necessary 
for immunoassays. The general methods for preparing antibodies specific for a target protein 
are well known in the art and described in an earlier section. Further, some antibodies with 
desirable specificity may already be available for immunoassays (e.g., various mAb for 
EpCAM). 

[0216] Once antibodies specific for a target protein, e.g., OPN or EpCAM, are available, 
the level the target protein in a patient can be measured by a variety of immunoassay methods 
with qualitative and quantitative results available to the clinician. Various samples from the 
patient, such as blood or liver tissue, can be used in the immunoassays to detected the in vivo 
target protein level according to the general methods described in an earlier section. For a 
review of immunological and immunoassay procedures in general see, e.g., Stites, supra; U.S. 
Patent Nos. 4,366,241; 4,376,1 10; 4,517,288; and 4,837,168. 

V. Administration of Agents Inhibiting Target Protein Activity and Pharmaceutical 
Compositions 

[021 7] Agents that inhibit the activity of a target protein, e.g. , OPN or EpCAM, can be 
administered directly to the human patient for modulation of the target protein activity in 
vivo. Administration is by any of the routes normally used for introducing an antagonist or 
inhibitor compound into ultimate contact with the tissue to be treated, optionally using the 
tongue or mouth. The antagonists or inhibitors are administered in any suitable manner, 
optionally with pharmaceutical^ acceptable carriers. Suitable methods of administering such 
antagonists or inhibitors are available and well known to those of skill in the art, and, 
although more than one route can be used to administer a particular composition, a particular 
route can often provide a more immediate and more effective reaction than another route. 
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[0218] Pharmaceutically acceptable carriers are determined in part by the particular 
composition being administered, as well as by the particular method used to administer the 
composition. Accordingly, there is a wide variety of suitable formulations of pharmaceutical 
compositions of the present invention {see, e.g., Remington 's Pharmaceutical Sciences, 17 th 
5 ed, 1985). 

[0219] The antagonists or inhibitors, alone or in combination with other suitable 
components, can be made into aerosol formulations (i.e., they can be "nebulized") to be 
administered via inhalation. Aerosol formulations can be placed into pressurized acceptable 
propel lants, such as dichlorodifluoromethane, propane, nitrogen, and the like. 

10 [0220] Formulations suitable for administration include aqueous and non-aqueous 

solutions, isotonic sterile solutions, which can contain antioxidants, buffers, bacteriostats, and 
solutes that render the formulation isotonic, and aqueous and non-aqueous sterile suspensions 
that can include suspending agents, solubilizers, thickening agents, stabilizers, and 
preservatives. In the practice of this invention, compositions can be administered, for 

15 example, by orally, topically, intravenously, intraperitoneal^, intravesically or intrathecally. 
Optionally, the compositions are administered orally or nasally. The formulations of 
compounds can be presented in unit-dose or multi-dose sealed containers, such as ampules 
and vials. Solutions and suspensions can be prepared from sterile powders, granules, and 
tablets of the kind previously described. The modulators can also be administered as part a of 

20 prepared food or drug. 

[0221] The dose administered to a patient, in the context of the present invention should be 
sufficient to effect a beneficial response in the subject over time. The dose will be 
determined by the efficacy of the particular signal modulators employed and the condition of 
the subject, as well as the body weight or surface area of the area to be treated. The size of 
25 the dose also will be determined by the existence, nature, and extent of any adverse side- 
effects that accompany the administration of a particular compound or vector in a particular 
subject. 

[0222] In determining the effective amount of an antagonist or inhibitor to be administered 
in a physician may evaluate circulating plasma levels of the agent, its toxicities, and the 
30 production of antibodies against the agent. In general, the dose equivalent of an antagonist or 
inhibitor is from about 1 ng/kg to 10 mg/kg for a typical subject. 
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[0223] For administration, antagonists or inhibitors of the present invention can be 
administered at a rate determined by the LD-50 of the antagonist, and the side-effects of the 
inhibitor at various concentrations, as applied to the mass and overall health of the subject. 
Administration can be accomplished via single or divided doses. 

VI. Examples 

[0224] It is understood that the examples and embodiments described herein are for 
illustrative purposes only and that various modifications or changes in light thereof will be 
suggested to persons skilled in the art and are to be included within the spirit and purview of 
this application and scope of the appended claims. All publications, patents, and patent 
applications cited herein are hereby incorporated by reference in their entirety for all 
purposes without limitation. 

A. Example 1: Predicting a predisposition for Hepatocellular Carcinoma 
metastasis 

1 MATERIALS AND METHODS 

a) Patients and tissue samples. 
[0225] All of the HCC samples were obtained with informed consent from patients who 
underwent curative resection in Liver Cancer Institute, Zhongshan Hospital of Fudan 
University in China. A total of 107 paired primary HCC, metastatic HCC, and adjacent non- 
tumor normal liver tissue samples were obtained from 40 patients who were pathologically 
diagnosed as HCC and underwent hepatectomy at the Liver Cancer Institute, Zhongshan 
Hospital of Fudan University (formerly Shanghai Medical University) in China. Prior to 
surgery, each patient was examined by computer tomography of abdomen and chest X-ray, 
and some patients also were examined by isotope scanning of bone if necessary. Among the 
107 paired samples, 81 were from 27 patients who had primary HCC, corresponding adjacent 
non-tumor liver tissue and metastatic HCC [15 with intra-hepatic spreads (group P) and 12 
with tumor thrombus in branch of portal vein (group PT)], and 26 were from 13 patients who 
had only a single primary HCC and corresponding non-tumor liver tissue (without detectable 
metastasis at the time of surgery). Tumors and non-tumor tissues were grossly dissected, 
snap-frozen in liquid nitrogen immediately after removal, and stored at -70°C until use. We 
confirmed microscopically that tumor tissue samples and their metastases consisted mostly of 
carcinoma cells and that non-tumor adjacent liver samples did not exhibit any tumor cell 
invasion. Of the 40 patients, 39 were male, and one was female. Patients' age ranged from 
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36 years to 74 years, with a median age of 50 years. The size of the primary HCC ranged 
from 1 .3 cm to 17.5 cm in diameter with a median diameter of 7.2 cm, of which 65% (26/40) 
were > 5 cm in diameter and remaining were <5 cm in diameter. Thirty-two cases (80%) 
had co-existing liver cirrhosis. Serologically, all of the 40 patients with an exception of one 
5 were HBV-positive, but no one was HCV-positive. Twenty-seven patients (68%) had an 
elevated serum concentration of alpha-fetoprotein (AFP) (>20 ng/ml). 

b) RNA preparation, cDNA Microarrays and Hybridization. 
[0226] Total RNA was extracted from each sample using TRIzol Reagent (Life 
Technologies, Inc.) according to the manufacturer's specification. The cDNA microarrays 

10 were fabricated at the Advanced Technology Center, NCI. Each array contains 91 80 cDNA 
clones with 7102 "named" genes, 1 179 EST clones, and 122 Incyte clones. Preparation of 
fluorescent cDNA targets by a direct labeling approach and the cDNA microarray 
hybridization were essentially as described by Wu et al., Oncogene 20:3674-3682, 2001. 
Briefly, the fluorescent targets were prepared as following: 100 /xg of total RNA from non- 

1 5 cancerous liver tissue were labeled with Cy3-conjugated deoxynucleotides or 200 /xg of total 
RNA from primary HCC or metastasis were labeled with Cy 5 -conjugated deoxynucleotides 
(Amersham) by the oligo dT-primed polymerization using Superscript II reverse 
transcriptase (Life Technologies). The targets were then mixed together and added to the 
microarrays, and then incubated overnight (12-16 hours) at 42°C. Prior to hybridization, each 

20 microarray was pre-hybridized at 42°C for at least one hour in pre-hybridization buffer 

containing 5* SSC, 0.1% SDS and 1% BSA. The slides were washed at room temperature in 
each with 2x SSC, 0.1% SDS and lx SSC and0.2x SSC for 2 min, respectively, and washed 
in 0.05x SSC for 1 min. Most of samples, when indicated, were done in duplication. The 
Cy3 and Cy5 fluorescent intensities for each clone were determined by the Axon GenePix 

25 4000 scanner, and were analyzed by the GenePix Pro 3.0 software to subtract the background 
signals. The expression data were then filtered based on their channel intensities, spots size 
and flag, and the Cy5/Cy3 ratios were calculated and normalized by median-centering the 
log-ratio of all genes in each array. 

c) Data Analysis and Statistical Analysis. 

30 [0227] Unsupervised hierarchical clustering analysis was done by the CLUSTER and 

TREEVIEW software using median centered correlation and complete linkage (Eisen et al., 
supra). We also used the BRB- Array Too Is software, an integrated package for the 
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visualization and statistical analysis of cDNA microarray gene expression data developed by 
the Biometric Research Branch of the National Cancer Institute, for both unsupervised and 
supervised analyses. The Class Comparison Tool based on univariate F-tests was used to 
find genes differentially expressed between predefined clinical groups at a significance level 
of P O.001 or 0.002. The permutation distribution of the F-statistic, based on 2000 random 
permutations was also used to confirm statistical significance. In comparing primary to 
metastatic tumors of the same patient, a paired value t-statistic was used in the same manner. 
The multi-variate Compound Covariate Predictor (CCP) Tool with a "leave-one-out" cross- 
validation test using 2000 random permutations at a significant level of PO.OOl was used to 
classify predefined clinical groups based on their gene expression profiles. In each cross- 
validation step one sample is omitted and a multivariate CCP is created based on the genes 
that are univariately significant at the specified level in the training set consisting of the 
samples not omitted. This CCP is used to classify the omitted sample and it is then noted 
whether the classification is correct or incorrect. This is repeated with all samples excluded 
one at a time. The total cross-validated misclassification rate is thereby determined. The 
statistical significance of the cross- validated misclassification rate is determined by repeating 
the entire cross-validation procedure to data with the class membership labels randomly 
permuted 2000 times. The CCP is based on a weighted linear combination of gene 
expression variables that are univariately significant in the training set with the weights being 
the corresponding t-statistics as described in Radmacher et al., supra. When the CCP was 
used to classify paired primary and metastatic tissue, the cross-validation was performed with 
one pair at a time omitted and the classification based on the paired differences in expression 
for each gene. Averaged gene expression data from duplicated samples were included for the 
analysis. 

[0228] To generate a prediction model to classify HCC with metastasis potential, we 
randomly selected 10 PN samples and 10 PT samples as a training set. A total of 20-blinded 
new HCC samples were included as a testing set. The classification of new samples was 
based on the computation with the following linear combination: L = Zj t; *(xj - mj), where t; 
= t-value for gene i in the classifier, x; = log-ratio of gene i in the new sample to be classified, 
and mi = midpoint between PN and PT groups for gene i (see Table 2). Additional details are 
available in BRB-ArrayTools Users Guide. The Kaplan-Meier Survival analysis was used to 
compare patient survival, using an Excel-based WinSTAT software. The statistical P value 
was generated by the Cox-Mantel log-rank test when PN was compared to P or PT. 
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d) Semi-quantitative PT-PCR and Western blotting. 
[0229] Total RNA was reverse-transcribed with SUPERSCRIPT™ II RNase H" Reverse 
Transcriptase and Random hexamers (Invitrogen Inc.). PCR was done with 26 cycles (94°C, 
30 sec; 53°C, 30 sec; 72°C, 1 min) followed by an extra cycle at 72°C for 10 min using the 
5 following primers: OPN sense 5 '-GACTCGAACGACTCTGATGATGTA-3 ' (SEQ ID 

NO:3); OPN antisense 5 ' -CTGGGC AACGGGGATGG-3 * (SEQ ED NO:4); and HotStarTaq 
Master Mix (QIAGEN). QuantumRNA™ 18S (Ambion) was used as an internal standard. 
Densitometry was used to quantify the amount of OPN, which was normalized by the 18S 
product. Western blot analysis was done essential as described by Wu et al., supra. Briefly, 
10 protein lysates from CCL13, SK-Hep-1 and Hep3B cells were prepared in RJPA buffer (50 
mM Tris-HCl, pH 7.4/150 mM NaCl/1% Triton X-100/1% deoxycholate/1.0% SDS/1% 
aprotinin), separated on 10% SDS-PAGE, transferred to an Immobilin-P membrane 
(Millipore, Bedford, MA), probed with a rat monoclonal anti-OPN antibody (Chemicon 
International), and visualized by the ECL-based assay (Amersham). 

15 e) Ceil lines and In vitro invasion assay. 

[0230] Two human hepatoma derived cell lines with different metastatic potential, SK- 
Hep-1 and Hep3B, and one non-transformed liver cell line, CCL13 (Chang liver cells), were 
used to determine the functional association of OPN with metastatic potential using the BD 
BioCoat™ Matrigel™ Invasion Chamber (BD Biosciences) according to the manufacture's 

20 instruction. These cells were obtained from American Type Culture Collection. Cells were 
routinely maintained at 37°C in a humidified atmosphere of 5% CO2 in EMEM (GEBCOL) 
medium supplemented with 10% fetal bovine serum, 1* nonessential amino acids, 1* sodium 
pyruvate, 2 mM glutamine and penicillin/streptomycin. For invasion analysis, cells were 
plated in the up chamber in serum-free EMEM, and incubated in the absence or presence of 

25 either recombinant murine OPN (2 jag/ml) (R&D Systems) or a well-documented neutralizing 
antibody against OPN (3 |ag/ml) (R&D Systems) for 20 hours. The EMEM medium 
containing 5% FBS was added to the bottom chamber, serving as chemoattractants. The 
number of cells invading through the Matrigel™ membrane was calculated before and after 
adding OPN or antibody of OPN for each cell line. 

30 f) Tissue histology analysis. 

[0231] Paraffin-embedded tissue blocks were prepared and were subjected to serial sections 
with a thickness of 5 \im mounted on electrically charged glass slides. Slides were subjected 
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to hematoxylin and eosin (H&E) staining. Two pathologists read these slides independently 
for the histological diagnosis. For immunohistochemistry analysis, slides were deparafinized 
and processed for immunostaining as described by Forgues et al., J. Biol Chem. 276:22797- 
22803, 2001. Briefly, slides were incubated in microwave oven for 15 min in IX citrate 
5 buffer for antigen retrieval and then quenched with 3% hydrogen peroxide to block the 

endogenous peroxidase activity for 10 min. Following incubation with 10% donkey serum to 
block the non-specific binding, the sections were incubated over night at 4EC with a rat 
monoclonal anti-OPN antibody (Chemicon International). Biotinylated secondary antibodies 
and streptavidin peroxidase complex (ABC Elite kit, Vector Labs) were used. Chromogenic 
10 development was obtained by the immersion of sections in 3-3' di-aminobenzidine (DAB) 
solution (0.25 mg per ml with 3% hydrogen peroxide). The slides were counter-stained with 
Harris= Hematoxylin and de-hydrated with alcohol to Xylene, and mounted with Permount 
(Sigma). 

2. RESULTS 

a) Metastatic lesions are indistinguishable from their 
corresponding primary HCC. 
[0232] To define the specific changes associated with the metastatic process in HCC, we 
compared the gene expression profiles of primary HCC samples from individuals with either 
intra-hepatic spreads (group P) or tumor thrombi in the portal vein (group PT) together with 
their matched metastatic lesions, i.e., P-M or PT-M, respectively, with their corresponding 
non-cancerous liver tissues. Initially, we compared the gene expression profiles of 50 
primary and metastatic tumor samples from 30 randomly selected individuals [i.e., 10 
patients with metastasis-free HCC (group PN), 10 PT patients and 10 P patients]. We 
attempted to classify them into clinical groups with an unsupervised hierarchical clustering 
algorithm based on an overall expression similarity profile using either entire 9180 genes or 
approximately 2487 genes derived from a gene screen filter that excluded genes not 
significantly more variable than the median at P<0.01 . However, these clustering approaches 
did not yield any meaningful classification that corresponded to predefined clinical groups. 
Similarly, we could not obtain a meaningful classification using 107 genes from filtering 
genes with an average of 2-fold greater variations in the gene expression ratio when 
compared with their median. The results of this analysis imply that primary and metastatic 
HCC differ only by a relatively small subset of genes, whereas the gene clustering algorithm 
may be dominated by variations among many other genes, therefore, hindering classification. 
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[0233] To search for such small differences, we applied a supervised class comparison 
analysis with univariate F-tests and a global permutation test to define genes that were 
differentially expressed among predefined clinical groups. A comparison of five clinical 
groups (i.e., P, P-M, PT, PT-M, and PN) yielded a total of 143 significant genes (PO.0005). 
5 Multidimensional scaling analysis based on the first three principal components of these 143 
significant genes revealed that the PN samples are distinct from the remaining samples, while 
the P, P-M, PT, and PT-M samples are inseparable (Fig. la). Unexpectedly, the gene 
expression profiles of primary and matched metastatic HCC tumors were not significantly 
distinguishable. 

10 b) PN is distinct from PT and P. 

[0234] To confirm and extend the above findings, we performed a class comparison 
analysis of 30 primary HCC samples from PN, PT, and P patients. This analysis yielded a 
total of 383 significant genes (P<0.0005). A hierarchical clustering algorithm was then used 
to sort these 30 PN, P, and PT samples based on the expression profile of these 383 genes 

15 (Fig lb). Two major branches were observed in the hierarchical tree, one associated with PN 
samples, and the other with P and PT samples. Again, P and PT samples were not fully 
discriminated (Fig lb). Thus, primary metastasis-free HCC has a gene expression profile 
markedly different from that of primary HCC with metastatic lesions in the portal vein or 
elsewhere in liver parenchyma. 

20 [0235] To further define a gene set that could accurately discriminate into two predefined 
classes and to identify metastasis-associated genes, we used a supervised machine learning 
classification algorithm known as compound covariate predictor (CCP), which includes a 
"leave-one-out" cross-validation test to avoid the statistical problem of over-estimating 
prediction accuracy that occurs when a model is trained and evaluated with the same samples. 

25 This analysis also creates a multivariate predictor for determining which one of the two 
classes a given sample belongs to, and a gene list that is univariately significant at a given 
statistically significant level. We divided 50 HCC samples from 30 patients into various 
pairs based on different clinical criteria and applied the CCP to each pair (Table 1), using an 
entire gene set with a P value < 0.001. At this specified significance level, the expected 

30 number of false-positive genes in the classifier is less than 10. The misclassification rate was 
determined by leave-one-out cross-validation. For each step of the cross-validation in which 
one sample was left out, the selection of informative genes and the creation of the multi-gene 
classifier was repeated from scratch. The probability of obtaining as small a cross-validated 
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misclassification rate by chance was obtained by repeating the entire cross-validation 
procedure using 2000 random permutations of the class labels for the clinical criteria being 
evaluated. That gave rise to a classifier P (Table 1). Using this supervised machine learning 
classification algorithm, again we found no significant difference between paired PT and PT- 
M samples (Table 1). Gene expression profiles in P and PT samples were almost identical to 
their paired metastatic P-M and PT-M samples (Table 1). The number of genes in these 
classifiers was at the background (false-positive) level. These data are in agreement with the 
clustering and multidimensional scaling analysis described above. 

[0236] In contrast, we accurately predicted primary tumors ( 1 00%) from PN and PT 
samples with a total of 153 significant genes in the classifier (Table 2). The cross-validated 
misclassification rates were significantly lower than expected by chance (p<0.0005) (Table 
1). Similarly, we accurately predicted PN and P samples as well as PN and P/PT samples 
with significant numbers of genes in the classifiers (Table 1). However, the CCP yielded no 
statistical significant classification among P, PT, PT-M, and P-M, and the number of genes in 
these classifiers also was insignificant. Moreover, we found no statistically significant 
classification when tumor sizes, ages, tumor encapsulation, or cirrhosis were used as clinical 
categories. These data are consistent with the findings of class comparison analysis including 
multidimensional scaling and hierarchical clustering algorithm analyses. We conclude that 
primary and metastatic tumors have a very similar gene expression signature and that primary 
metastasis-free HCC tumors are distinct from primary HCC tumors with either tumor 
thrombus in portal vein or intra-hepatic spread. 

Table 1. Performance of classifier during "leave-one-out" cross validation * 



Total Number Number of genes 

Classifier Clinical number of cases Classifier in the 

category** groups of cases misclassified P value classifiers 



PN vs. PT 



PN vs. P 



PN vs. P/PT 



PN 
PT 

PN 
P 

PN 

P and PT 



10 
10 

10 
10 

10 
20 



0 
0 

1 

0 

2 
0 



O.0005 



<0.0005 



<0.001 



153 



157 



256 
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P vs. PT 

PT vs. PT-M 

5 

P/PT vs. P-M/PT-M 
10 P vs. PT-M 
PT vs. P-M 

15 

Tumor sizes 
Ages 

20 

Tumor encapsulated 
25 Cirrhosis 



P 10 

PT 10 

paired 10 
samples 

paired 20 
samples 

P 10 

PT-M 10 

PT 10 

P-M 10 

> 5 cm 16 

< 5 cm 14 

>45yr. 17 

<45yr. 13 

presence 9 

absence 21 

presence 14 

absence 6 



3 
4 

3 



5 



4 

3 

2 
4 

7 
4 

5 
7 

2 
4 

7 
6 
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0.216 20 

0.296 1 

0.132 7 

0.248 14 

0.163 9 

0.234 7 

0.334 4 

0.037 13 

0.798 1 



* Compound covariate predictor was used to classify various clinical groups with a total of 
30 9180 gene expression data at a significance level of P=0.001. The classifier was based on 
2000 random permutations. The expected number of false-positive genes in the classifier is 
10. 

** PN, single primary HCC; PT, primary HCC with tumor thrombi in portal vein; PT-M, 
tumor thrombi from paired PT; P, primary HCC with intra-hepatic metastasis; P-M, intra- 
35 hepatic metastasis from paired P; P/PT, both P and PT; P-M/PT-M, both P-M and PT-M; 
tumor sizes, diameter in length. 

c) A gene expression-based model from supervised machine 
learning algorithm can predict HCC patients with 
40 metastatic potential. 

[0237] The success in distinguishing PN from PT with CCP allowed us to develop a gene- 
expression-based model to predict HCC patients who had the potential to develop metastasis. 
We randomly selected primary HCC samples from 10 PN patients and 10 PT patients as a 
training set to generate a prediction model by "leave-one-out" cross-validated classification. 
45 The classification of training samples created a 1 53-gene list, which provided the base for 
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predicting testing samples, referred to as the "weighted voting" exercise by generating a 
multi-factorial L value (see Materials and Methods). We included all of the remaining 20 
primary HCC samples as a test set (15 P patients, 3 additional PN patients, and 2 additional 
PT patients). Fig 2 shows the calculated "weighted voting" L value with metastatic samples 
yielding negative values and non-metastatic samples yielding positive values. All of the test 
samples with the exception of one "P" sample (S29) were classified to the metastatic group 
(Fig 2a). Patient follow-up data indicated that one PN patient (S56) was found to develop 
lung metastases 8 months following surgery, the second PN patient (S57) was cancer-free 9 
months after surgery, and the third patient (S55) did not respond to the follow-up request. 
We also analyzed these samples by multidimensional scaling based on the 153-gene set 
obtained from the PN/PT comparison. It appears that S29 has a gene expression profile more 
similar to the P and PT groups than to that of the PN group (Fig 2b), suggesting that S29 
should belong to the P and PT groups. Thus, we accurately classified at least 18 of 20 
blinded HCC patients (90%) with metastatic potential. 



Table 2 153 Significant genes for predicting metastasis and their values necessary for computing multi-factorial 
L value in the prediction model. ' 



UG 
cluster 


Symbol 


Description 


t-value 


Midpoint 


p-vatue 


Unique 
id 


Hs.36566 


LIMK1 


LIM domain kinase 1 


-7.7122 


-0.433 


0.000000 


160082 


Hs.75573 


CENPE ~" — 


centromere protein E (312kD) 


-7.2301 


0.217 


0.000001 


160128 


Hs.81217 


F2D2 


frizzled (Drosophila) homolog 2 


-7.0334 


-0.499 


0.000002 


160028 


Hs. 146580 


EN02 


enolase 2. (gamma, neuronal) 


-6.9978 


-0.238 


0.000002 


160068 


Hs.222 


ITGA9 


integrin, alpha 9 


-6.699 


-0.159 


0.000004 


160135 


Hs. 75887 


COPA 


coatomer protein complex, subunit alpha 


-6.4035 


-0.241 


0.000007 


159890 


Hs.6727 


KIAA0660 


Ras-GTPase activating protein SH3 domain 


-6.3742 


-0.281 


0.000007 


160103 


Hs.89578 


GTF2H1 


general transcription factor IIH, polypeptide 1 


-6.2909 


-0.178 


0.000006 


164987 


Hs. 180941 


VPS41 


vacuolar protein sorting 41 (yeast homolog) 


-5.9459 


-0.331 


0.000013 


159888 


Hs.99236 


RGS20 


regulator of G-protein signaling 20 


-5.8503 


-0.264 


0.000015 


161959 


Hs.274 t 


MATK 


megakaryocyte-associated tyrosine kinase 


-5.8166 


-0.366 


0.000016 


160015 


Hs. 19481 6 


STOML1 


stomatin (EBP72)-like 1 


-5.7855 


-0.124 


0.000018 


162695 


Hs. 79516 


BASP1 


membrane attached signal protein 1 


-5.5974 


-0.415 


0.000026 


159882 


Hs.733 


EPB42 


erythrocyte membrane protein band 4.2 


-5.5395 


-0.378 


0.000029 


160067 


Hs.87539 


ALDH3B2 


aldehyde dehydrogenase 3 family, member B2 


-5.5356 


-0.351 


0.000030 


166071 


Hs.5947 


MEL 


mel transforming oncogene ] 


-5.434 


-0.452 


0.000045 


160104 


Hs. 11 8354 


CAT56 


CAT56 protein 


-5.4077 


-0.316 


0.000047 


165027 


Hs.27744 


RAB3A 


RAB3A, member RAS oncogene family 


-5.35 


-0.338 


0.000044 


160099 


Hs.7984 


PSCD3 


pleckstrin homology 


-5.3177 


-0.143 


0.000047 


159887 


Hs. 104519 


PLD2 


phospholipase 02 


-5.2672 


-0.275 


0.000052 


159999 


Hs.4748 


ADCYAP1R1 


adenylate cyclase activating polypeptide 1 


-5.2037 


-0.166 


0.000060 


161460 


Hs.83155 


ALDH3B1 


aldehyde dehydrogenase 3 family, member B1 


-5.2005 


-0.44 


0.000088 


159838 


Hs.283822 


RHD 


Rhesus blood group, D antigen 


-5.1898 


-0.369 


0.000062 


164821 


Hs.2175 


CSF3R 


colony stimulating factor 3 receptor 


-5.1684 


-0.136 


0.000065 


160114 


Hs.3094 


KIAA0063 


KIAA0063 gene product 


-5.162 


-0.325 


0.000095 


160091 


Hs. 11 9273 


KIAA0296 


KIAA0296 gene product 


-5.132 


-0.545 


0.000070 


159951 


Hs.23672 


LRP6 


low density lipoprotein receptor-related protein 6 


-5.1081 


-1.13 


0.000074 


162040 


Hs. 11 8804 


EN03 


enolase 3, (beta, muscle) 


-5.0415 


-0.76 


0.000085 


164468 
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Hs.74502 


CTRB1 


chymotrypsinogen B1 


-5.0381 


-0.216 


0.000086 


159787 


Hs.194148 


YES1 


v-yes-1 Yamaguchi sarcoma viral oncogene 


-5.0064 


-0.413 


0.000092 


159875 






Unknown (lncytePD:1404153) 


-4.9541 


-0.155 


0.000103 


160122 


Hs.772 


GYS1 


glycogen synthase 1 (muscle) 


-4.913 


-0.478 


0.000112 


160222 


HS. 153203 


MDFI 


MyoD family inhibitor 


-4.8908 


-0.773 


0.000138 


163880 


Hs.247423 


ADD2 


adducin 2 (beta) 


-4.8064 


-0.609 


0.000141 


162687 


Hs.22785 


GABRE 


gamma-aminobutyric acid (GABA) A receptor 


-4.8046 


-0.188 


0.000142 


159794 






Unknown (lncytePD:2685601) 


-4.7898 


-0.307 


0.000147 


165108 


Hs.97087 


CD3Z 


CD3Z antigen, zeta polypeptide (TiT3 complex) 


-4.7723 


-0.487 


0.000152 


160043 


Hs.79006 


DTYMK 


deoxythymidylate kinase (thymidylate kinase) 


-4.7693 


0.254 


0.000153 


161858 


Hs.26915 


SPTBN2 


spectrin, beta, non-erythrocytic 2 


-4.7666 


-0.364 


0.000154 


160846 






Unknown (lncytePD:2509789) 


-4.7523 


-0.175 


0.000159 


164920 


Hs.38586 


HSD3B1 


hydroxy-delta-5-steroid dehydrogenase 


-4.7519 


-0.392 


0.000159 


164787 


Hs.32966 


GUCA2B 


guanylate cyclase activator 2B (uroguanylin) 


-4.7519 


-0.368 


0.000159 


164851 


Hs. 12773 


ACOX3 


acyl-Coenzyme A oxidase 3, pristanoyl 


-4.7455 


-0.25 


0.000187 


162487 


Hs.2281 


CHGB 


chromogranin B (secretogranin 1 ) 


-4.7199 


-0.269 


0.000171 


160078 


Hs.25197 


STUB1 


STIP1 homology and U-Box containing protein 1 


-4.6897 


-0.264 


0.000183 


160555 


Hs. 169536 


RHAG 


Rhesus blood group-associated glycoprotein 


-4.6648 


-0.326 


0.000193 


164916 


Hs.96 


PMAIP1 


PMA-induced protein 1 


-4.6573 


-0.124 


0.000196 


160112 


Hs. 153053 


CD37 


CD37 antigen 


-4.6051 


-0.652 


0.000220 


160033 


Hs. 155227 


EPHB4 


EphB4 


-4.5965 


-0.276 


0.000257 


168938 


Hs.92282 


PITX2 


paired-like homeodomain transcription factor 2 


-4.584 


-0.149 


0.000230 


160123 


Hs.79123 


KIAA0084 


KIAA0084 protein 


-4.583 


-0.296 


0.000231 


1598B6 


Hs. 180878 


LPL 


lipoprotein lipase 


-4.5304 


-0.18 


0.000259 


160485 


Hs. 75658 


PYGB 


phosphorylase, glycogen; brain 


-4.5152 


0.027 


0.000268 


159778 


Hs.286132 


MN7 


D15F37 (pseudogene) 


-4.503 


-0.314 


0.000275 


167399 


Hs.57600 


AP1S1 


adaptor-related protein complex 1 


-4.4656 


-0.26 


0.000299 


160042 


Hs.67688 




ESTs 


-4.4472 


-0.458 


0.000311 


162920 


Hs.172458 


IDS 


iduronate 2-sulfatase (Hunter syndrome) 


-4.4324 


-0.259 


0.000322 


160243 


Hs.80768 


CLCN7 


chloride channel 7 


-4.4298 


0.058 


0.000324 


161279 


Hs.347527 


SLC20A2 


solute carrier family 20, member 2 


-4.4173 


-0.308 


0.000333 


159936 


Hs. 72550 


HMMR 


hyaluronan-mediated motility receptor (RHAMM) 


-4.3918 


-0.443 


0.000352 


167575 






Unknown (lncytePD:1681876) 


-4.3868 


-0.275 


0.000356 


166536 


Hs.242947 


DGKI 


diacylglycerol kinase, iota 


-4.3835 


-0.369 


0.000358 


161826 


Hs. 158249 


KIAA0406 


KIAA0406 gene product 


-4.3376 


-0.066 


0.000397 


159825 


Hs. 182577 


INPP5B 


inositol polyphosphate-5-phosphatase. 75kD 


-4.315 


-0.269 


0.000417 


160074 


Hs.37054 


EFNA3 


ephrin-A3 


-4.3085 


-0.355 


0.000423 


161846 


Hs.334841 


SELENBP1 


selenium binding protein 1 


-4.3016 


-0.481 ! 


0.000430 


169315 


Hs.81454 


KHK 


ketohexokinase (fructokinase) 


-4.2966 


-0.36 


0.000434 


159931 


Hs.84790 


KIAA0225 


KIAA0225 protein 


-4.2732 


-0.151 


0.000582 


160472 


Hs.94498 


LILRA2 


leukocyte immunoglobulin-like receptor 


-4.2714 


-0.308 


0.000459 


161424 


Hs. 151393 


GCLC 


glutamate-cysteine ligase, catalytic subunit 


-4.2523 


-0.421 


0.000479 


166059 


Hs. 151738 


MMP9 


matrix metalloproteinase 9 


-4.2337 


-0.473 


0.000722 


159912 


Hs.69707 


HCGll-7 


HCGII-7 protein 


-4.2223 


0.802 


0.000512 


161462 


Hs. 152251 


F2D5 


frizzled (Drosophila) homolog 5 


-4.2088 


-0.386 


0.000528 


164899 






Unknown (lncytePD:1 570216) 


-4.2019 


-0.336 


0.000536 


159962 


Hs.61712 


PDK1 


pyruvate dehydrogenase kinase, isoenzyme 1 


-4.1746 


-0.251 


0.000570 


160462 


Hs.66731 


HOXB13 


homeo box B13 


-4.1722 


-0.739 


0.000573 


159868 


Hs-80976 


MKI67 


antigen identified by monoclonal antibody Ki-67 S 


-4.1699 


-0.148 


0.000642 


160039 


Hs.283664 


ASPH 


aspartate beta-hydroxylase 


-4.1693 


0.062 


0.000576 


160084 


Hs.76688 


CES1 


carboxylesterase 1 


-4.1577 


-1.285 


0.000591 


164490 


Hs.154230 


NDP52 


nuclear domain 10 protein 


-4.1483 


-0.178 


0.000604 


159958 


Hs.75596 


IL2RB 


interleukin 2 receptor, beta 


-4.1376 


-0.268 


0.000688 


159942 


Hs.4756 


FEN1 


flap structure-specific endonuclease 1 


-4.1222 


0.195 


0.000640 


160035 


Hs.673 


IL12A 


interleukin 12A 


-4.0844 


-0.082 


0.000696 


162579 


Hs. 89230 


KCNN3 


potassium calcium-activated channel 


-4.0745 


0.008 


0.000711 


161095 


Hs.799 


DTR 


diphtheria toxin receptor 


-4.0616 


-0.421 


0.000812 


167412 


Hs. 120360 


PLA2G6 


phospholipase A2, group VI 


-4.0344 


-0.577 


0.000778 


160058 


Hs.171075 


RFC5 


replication factor C (activator 1)5 (36.5kO) 


-4.0263 


0.114 


0.000792 


161332 


Hs.99899 


TNFSF7 


tumor necrosis factor superfamily, member 7 


-4.0211 


-0.221 


0.000801 1 15981 7 
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Hs.9605 


CPSF5 


cleavage and polyadenylation specific factor 5 


-4.0101 


0.079 


o nnnfl?i 


i oy r do 


Hs.95262 


NFRKB 


nuclear factor related to kappa B binding protein 


-4.0081 


-0.162 




1 C7CQQ 

lor oyo 


Hs.37129 


SCNN1B 


sodium channel, nonvoltage-gated 1 


-4.0053 


-0.244 


n nnnR*^n 


i oi 1 y 1 


Hs.296371 


RAB28 


RAB28, member RAS oncogene family 


-4.0038 


0.343 


n nonp.^** 


i ouoyy 


Hs.83795 


IRF2 


interferon regulatory factor 2 


-3 9955 


-0.527 


n nnriRAo 


101 "OO 


Hs.85087 


LTBP4 


latent TGF-beta binding protein 4 


-3.9927 


-0 ^4 


U.UUUOD4 


159923 


Hs.267448 


CGI-85 


CGI-85 protein 


-3.986 




n AAflDCC 


166502 


Hs. 121521 


ABL2 


v-abl murine leukemia viral oncogene homolog 2 


-3.9746 


-0.347 


n aaapqq 
u.uuuooy 


loob 12 


Hs.28166 


CRSP8 


cofactor for Sp1 transcriptional activation 


-3.9714 


0.07 


n nnnfto^ 


i D^yyo 


Hs.239706 


GAB1 


GRB2-associated binding protein 1 


-3.9529 


-0.347 


n nnnQT^ 


1 CO/1 i c 


Hs. 177687 


AKR1C4 


aldo-keto reductase family 1, member C4 


-3.9499 


0.145 


n nnno^Q 


lol /Oo 


Hs.25648 


TNFRSF5 


TNF receptor superfamily, member 5 


-3.9371 


-0.147 


0 OOfiQfifi 


I ODUOO 


Hs.858 


RELB 


v-rel viral oncogene homolog B 


-3.935 


-0.12 




1 O401U 


Hs. 155314 


KIAA0095 


KIAA0095 gene product 


-3.9244 


-0 20fi 


n nnnoQii 




Hs.8358 


FLJ20366 


hypothetical protein FLJ20366 


3.9437 


0.201 




i D^ 1 ** 0 


Hs.112819 




ESTs 


3.9573 


0.217 




i otjyoy 


Hs. 126263 




ESTs, Highly similar to A38712 fibrillarin 


3.9651 




n nnno.no. 
u.uuuyuo 


IO/474 


Hs. 10669 


DDEF1 


development and differentiation enhancing factor 1 


3.9709 




u.uuuoyo 


1 o4u2o 


Hs.99216 




ESTs, similar to ALU8 


3.9802 


0 ?RR 


U.UUUOr O 


169148 


Hs.98738 


GRTH 


gonadotropin-regulated testicular RNA helicase 


3.991 1 


a 1 no 

-u. i yo 


U.OUUob/ 


166657 


Hs.28274 




Homo sapiens cDNA: FLJ22049 fis 


3.9912 


n 90R 


n nnnftcr 

U.UUU03 / 


io39o9 


Hs. 186564 




ESTs 


4.0128 


0.177 


n nnnm ez 

U.UUUO I D 


H CI A Art 

1O3409 


Hs.34045 


FLJ20764 


hypothetical protein FU20764 


4.0142 


0.325 


U.UUUO 1 ** 


■1 cpcoi 


Hs.3686 


KIAA0978 


KIAA0978 protein 


4.0211 


0.308 


n nflnAm 

U.UUUOU 1 


Of 


Hs.172148 




ESTs 


4.0307 


0.179 


U.UUU f 0*f 


ibo/4o 


Hs.239499 


KIAA0185 


KIAA0185 protein 


4.0679 


0.17 


n nnn79o 


1 004 1 O 


Hs.169341 


HTPAP 


HTPAP protein 


4.1 104 


n Ana 


a aaacctt 
U.uUUob/ 


163274 


Hs.44131 


KIAA0974 


KIAA0974 protein 


4.1179 


n ror 


U.uuuo4b 


164589 


Hs.2969 


SKI 


v-ski avian sarcoma viral oncogene homolog 


4.1484 


0.323 


U.UUUDU4 


164039 


Hs.80618 


FLJ20015 


hypothetical protein 


4.1716 


n o^r 


u.UUUO/ o 


1o3oo3 


Hs. 136309 


SH3GLB1 


SH3-domain, GRB2-like, endophilin B1 


4.1832 


u.ooy 


u.uurjooy 


162621 


Hs.274293 




Homo sapiens mRNA; cDNA DKF2p76lGl 1 1 1 


4.1964 


-0.013 


n nnn^A^ 


1 CCCAy( 


Hs.21479 


UBN1 


ubinuclein 1 


4.2096 


n 


a aaao? 
u.uUUDZ/ 


167995 


Hs.155160 


SRP46 


Splicing factor, arginine/serine-rich, 46kD 


4.2889 


0.291 


u.uuui**^ 


H COC77 
1 DOO/7 


Hs. 105584 


RPS6KA4 


ribosomal protein S6 kinase, 90kD, polypeptide 4 


4.3239 


0.349 


u.uuu*»uy 


ioo1o9 


Hs.279886 


RANBP9 


RAN binding protein 9 


4.336 


w.OUJ 


n nnnioo 
u.uuu jyo 


168730 


Hs.197298 


NS1-BP 


NS1 -binding protein 


4.346 




n nnn*?Ao 
u.uuuooy 


1oo257 






Unknown (lncytePD:2895226) 


4.3857 




n nnmcr 
u.uuuoo/ 


161881 


Hs. 36793 


FLJ23188 


hypothetical protein FLJ23188 


4 ^Q07 

t. OS7U / 


n a^a 

U.*f D** 


0.000353 


168869 


Hs. 17384 




ESTs 


4 1Q7A 

H.OJ7/ O 


n r\A 
-u.u** 


A Anno y« -7 

0.000347 


163225 


Hs.78524 


HTCD37 


TcD37 homolog 


4 A.0Q7 


n 

U.JO I 


0.000338 


167570 


Hs.2301 


DBH 


dopamine beta-hydroxylase 


4.4196 


n 7ai 

U. / HO 


a nnni7c 


168202 


Hs. 118795 


FLJ 10008 


hypothetical protein FLJ 10008 


4.4386 


-U . UD«* 


U.uUUJI 7 


166653 


Hs.33074 




Homo sapiens, clone IMAGE:3606519 


4.5036 


1135 


n nnno7^ 

U.UUUZ / D 


^ COCOA 


Hs.4988 




Homo sapiens clone 2471 1 mRNA sequence 


4.5042 


0.016 


y.uuu^/^ 


160165 


Hs.288872 


FLJ21439 


hypothetical protein FLJ21439 


4.5242 


n 9Q 


J.UUU^bJ 


168393 


Hs.323712 


KIAA0615 


KIAA0615 gene product 


4.5292 


0.024 


J.UUu^DU 


1 CIOC 

1o3d2d 


Hs. 14051 




Homo sapiens mRNA; cDNA DKF2p434A2417 


4.5538 


0.215 


3.000246 


1 uooo • 


Hs.296287 




Similar to bromodomain-containing 4 < 


4.5576 


0.499 


D. 000244 




Hs.57847 




ESTs, similar to CASPASE-4 PRECURSOR 


4.63 


0.264 < 


D. 000208 


1651 94 


Hs.26289 




ESTs 


4.7062 


D.948 ( 


3 000176 


i oy jou 


Hs.11123 


DKF2P564G092 


DKF2P564G092 protein i 


4.9593 ( 


D.476 { 




1 OOUDn 


Hs.288908 




sDNA: FLJ21913 fis. clone HEP03888 < 


4.9597 ( 


J.JJU I 


^ aaa mi 


168395 


Hs.77495 


UBXD2 


JBX domain-containing 2 i 


J. 9758 ( 




% aaaaqq 
./.uuuuyo 


icn4 nn 

160190 


Hs.24341 


TA2 


/anscriptional co-activator with PD2-binding motif i 


3.0014 ( 


).127 ( 


).000093 • 


164176 


Hs.50133 




ESTs * 


3.153 ( 


).243 C 


).000067 1 


68567 


Hs.262958 


DKF2P434B044 I 


lypothetical protein DKF2p434B044 l 


3.1851 ( 


).378 ( 


).000075 1 


69042 


Hs.53478 




Homo sapiens cDNA FLJ 12366 fis « 


>.2202 ( 


).111 c 


).000058 1 


68383 


Hs.80658 


JCP2 i 


jncouplmg protein 2 f 


>.24B3 


1. 308 C 


).000054 ' 


68158 
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Hs.209065 


FLJ14225 


hypothetical protein FLJ 14225 


5.3394 


0.468 


0.000045 


164339 


Hs.92357 


GALK1 


galactokinase 1 


5.6456 


1.15 


0.000037 


169675 


Hs.50373 




ESTs 


5.7625 


0.94 


0.000029 


165500 


Hs.266959 


HBG1 


hemoglobin, gamma A 


5.9704 


1.164 


0.000026 


168326 


HS.25566 




ESTs 


6.1164 


0.182 


0.000009 


168197 


HS.25277 


FLJ21065 


hypothetical protein FLJ21065 


6.1957 


0.116 


0.000008 


164202 



[0238] The above outcome predictor separated 40 patients into two groups, one being 
metastatic and the other being non-metastatic. Kaplan-Meier survival data indicates that 
patients who were predicted to be metastatic had significantly shortened survival when 
5 compared with patients without detectable metastasis (Fig 2c). Because the mortality of HCC 
patients relies largely on whether they develop intra-hepatic metastasis, our results indicate 
that the gene set used in the classifier provides an accurate gene expression signature 
reflecting liver cancer metastasis and survival. 

d) Osteopontin promotes HCC metastasis. 

10 J0239] The above study indicates that the genes necessary for intra-hepatic metastasis 
should be included in the prediction model. However, the list of 1 53 genes from the 
prediction model was based on a stringent criterion (P value at 0.001) to minimize the 
number of false-positive genes in the classifier that is needed for an accurate classification. 
Such stringent criterion may exclude many genes that could be significant for metastasis 

15 progression. To broaden our search, we performed univariate F-tests with a total of 2000 
random permutations at a P value of < 0.002 on 10 PN and 10 PT primary HCC samples. 
This analysis yielded a total of 224 significant genes with less than 20 expected false- 
positives (see Table 3). To identify genes that may contribute to liver cancer metastasis, we 
inspected the 224-gene list and sorted the top 30 genes whose expressions were altered 

20 largely in PT and PT-M, but rarely in PN (see Table 4). These genes were median-centered 
and visualized by hierarchical clustering algorithm using centered correlation and complete 
linkage (Fig. 3a). 

[0240] A gene with an average of over 3-fold overexpression in PT, but not in PN, was 
identified as osteopontin (OPN) (SEQ ID NO:l), a secreted phosphoprotein that has recently 

25 been found to be highly expressed in metastatic breast tumors as well as malignant lung, 

colon, and prostate cancers. Comparison of microarray expression data indicated that OPN 
expression is elevated in most PT samples and their corresponding PT-M samples, but to a 
much lesser degree in the PN samples (Fig 3b). OPN overexpression in PT samples, but not 
in PN samples, was confirmed by a semi-quantitative RT-PCR analysis (Fig 3c and d). 

30 Immunohistochemical analysis (IHC) of OPN was also performed on 29 primary HCC 
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(including 16 new HCC cases) and 8 normal livers from healthy organ donors. The 
immunoreactivity of OPN on these samples was evaluated by a blinded fashion. Only 
metastatic tumors were positive for cytoplasmic OPN staining, especially in the area with 
high density of vasculature (Fig. 4). The IHC results mostly agreed with microarray and RT- 
5 PCR data (61 % positive cases; 1 1 of 1 8 metastatic HCC) (data not shown). Taken together, 
these studies demonstrate a good diagnostic value of OPN for metastatic HCC patients. 

[0241] To determine the role of OPN in metastasis, we compared the level of OPN in 
human HCC cell lines by Western blot and in vitro invasiveness by Matrigel assay. The level 
of OPN was high in SK-Hep-1, intermediate in Hep3B and low in CCL13 (Fig 5a), which 

1 0 coincided with their invasiveness (Fig 5b). An OPN neutralizing antibody significantly 

blocked invasion of SKHep-1 (pO.OOl) and Hep3B cells (p<0.04). However, recombinant 
murine OPN did not show any statistically significant stimulation (p>0.05) on Hep3B and Sk- 
Hep-1 cells, implying that either OPN produced by tumor cells is sufficient for maintaining 
an invasive phenotype, or that lesser effect is due to species difference. Similar results were 

1 5 obtained with 5 additional HCC cell lines (Fig 5c). However, the neutralizing antibody had 
little effect on cell viability and migration (Fig 5c, right panel). 

[0242J To extend above finding, we examined the role of OPN on pulmonary metastasis of 
HCC cells in nude mice. HCCLM3 cell line is a clone derived from MHCC97 cells with a 
high degree of pulmonary metastasis following subcutaneous (s.c.) injection (Li et al., J. 

20 Cancer Res. Clin. Oncology, 2002). Consistent with our recent data, a 100% of 

tumorigenicity was achieved in 1 week after s.c. injection. There was no significant 
difference in the size of primary tumors between control and anti-OPN groups (Figure 5 E), 
which is consistent with our in vitro results that anti-OPN does not affect HCC cell growth. 
At the 5th week, pulmonary metastatic lesions were detected in every mouse in the control 

25 group with most of the grade I-II tumor clusters and some grade III-IV tumor clusters (Figure 
5 E, F). The control mice had an average of 1 1.1 ± 2.9 tumor clusters per lung. In contrast, 
only about a half of mice in the anti-OPN group had developed lung metastasis and 
remaining mice developed mostly grade I tumor clusters with a combined average of 2.6 ± 
1 .0 tumor clusters per lung, and this effect was statistically significant (p<0.01). Therefore, 

30 anti-OPN antibody shows a significant inhibitory effect on the lung metastasis of HCCLM3 
cells. 
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Description 


hypothetical protein FLJ13213 


Unknown 


HTPAP protein 


TcD37 homolog 


hypothetical protein FLJ23188 


nuclear domain 1 0 protein 


tumor necrosis factor receptor superfamily, member 5 


Ras-GTPase activating protein SH3 domain-binding protein 2 


adenylate cyclase 3 


RAN binding protein 9 


RAD54 (S.cerevisiae)-like 


hypothetical protein from EUROIMAGE 1669387 


hypothetical protein FU 13119 


stomatin (EBP72)-like 1 


Unknown 


K1AA0225 protein 


antigen identified by monoclonal antibody Ki-67 


general transcription factor IIH, polypeptide 1 (62kD subunit) 


RAB3 A, member RAS oncogene family 


chromogranin B (secretogranin 1) 


paired-like homeodomain transcription factor 2 


mitogen-activated protein kinase kinase kinase 6 


pleckstrin homology, Sec7 and coiled/coil domains 3 


KIAA0406 gene product 


pyruvate dehydrogenase kinase, isoenzyme 1 


Name 


FLJ13213 
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B. Example 2: Predicting a predisposition for Hepatocellular Carcinoma 
1. Material and methods 

a) Patients and tissue samples 

[0243J Surgical specimens were collected with prior informed consent and with the 
5 protocols and the approval by the Institution Review Board of University of Minnesota. 
Liver samples were obtained from 59 end-stage chronic liver disease patients who received 
liver transplantation between 1995-2001. Disease- free liver samples from 8 liver donors 
were used as control. The collection of these samples was mainly managed through the Liver 
Tissue Procurement and Distribution System (LTPADS) at University of Minnesota, USA. 
10 Tumor and matched non-tumor liver samples from 64 patients were obtained through either 
the LTPADS program or Liver Cancer Institute at Fudan University, China. Frozen samples 
once received was stored immediately at -80°C in a tissue repository database. 

b) cDNA microarray 

[0244] Total RNA was extracted from frozen tissues by using Trizol reagent (Invitrogen, 
1 5 Gaithersburg, MD) according to the manufacturer's protocol. The quality of extracted RNA 
was determined by spectrophotometry and by the appearance of characteristic 28S and 18S 
rRNA fragment on a 1% agarose gel. Each RNA sample divided into several tubes same 
amount and stored -80°C. For the common reference of cDNA microarray, total RNA 
samples from 8 normal liver were combined together, and were aliquoted into each tubs. 

20 [0245] cDNA microarrays were purchased from NCI microarray facility, Advanced 
Technology Center, NCI, NIH (Gaithersburg, MD). These human UniGem v2.0 array 
contained 9180 cDNA clones that map into 8281 unique UniGene clusters (base on Hs 
Unigene Build #131 released on Feb. 28, 2001) and 122 Incyte EST clones (Incyte 
Genomics, Palo Alto, CA). The hybridization was performed according to an optimized 

25 protocol established by the NCI (Wu et al, Oncogene 20:3674-3682, 2001 ; Ye et al, Nature 
Med. 9:416-423, 2003). Fluorescent images of hybridized microarrays were obtained by 
using GenePix 4000 scanner and GenePix Pro software (Axon Instruments, Foster City, CA). 
Detailed information as being collected according to the proposed Minimum Information 
About a Microarray Experiment Standards (Brazma A et al., Nat Genet 2001) will be made 

30 available through the NCBI's Gene Expression Ominibus public database. 
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c) Statistical analysis 
[0246] A hierarchical clustering analysis was preformed using a relative gene expression 
ratio (Cy5/Cy3) to examine the relatedness among expression patterns of several gene lists 
and those in two risk groups. Cluster analysis was performed using Cluster software and 
visualized using Tree View software (Eisen et al., supra). Hierarchical clustering was 
performed following median centering normalization. 

[0247] Analyses were performed using BRB AirayTools developed by Dr. Richard Simon 
and Amy Peng of the Biometrics Research Branch at National Cancer Institute. The data 
from each array were scaled in order to normalize data for inter-array comparisons. The class 
comparison tool was used for comparing two pre-defined risk groups. The F-test was a 
generalization of the two-sample t-test for comparing values among groups. The class 
comparison tool computed an F-test separately for each gene using the normalized log-ratios 
for cDNA. Several other important statistics were also computed. The tool performed 
random permutations of the group. Based on these random permutations, the tool computed 
the permutation p value associated with each gene in the list. 

[0248] Classification of samples into one of two pre-determined classes based on gene 
expression data was performed using several algorithms including compound covariate 
predictor, K-nearest neibougher predictor, or support vector machine predictor. The predictor 
was built in two steps. First, a standard two-sample /-test was performed to identify genes 
with significant differences (at level 0.001) in log-expression ratios between the two classes. 
Second, the log-expression ratios of differentially expressed genes were combined into a 
single compound covariate for each sample; the compound covariate was used as the basis for 
class prediction. The compound covariate for sample i was defined as 

J 

where tj was the /-statistic for the two group comparison of classes with 
respect to gene./, x 0 was the log-ratio measured in specimen i for gene j and the sum is over 
all differentially expressed genes. 

[0249] We predicted the classification of a new sample by computing the following linear 
combination; 

L = Sj tj *(xj - nij) 

where t, was t-value for gene /, x, was log-ratio of gene i in the new sample to be classified, 
and m, was midpoint between the two classes for gene *. The index i run over all the genes 
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that are significant in the original analysis. When L was positive, then the new sample should 
be classified to be of the first phenotype label whereas L was negative, then the new sample 
should be classified to be of the second phenotype label. 

d) EpCAM expression and its in vitro inhibition 
5 [0250] The expression of EpCAM was assessed by semi-quantitative PCR. Total RNA was 
reversed-transcribed to produce single-stranded cDNA using random primers (Promega) with 
Superscript II reverse transcriptase (Invitrogen) according to manufacturer's protocol. PCR 
amplification was performed with QuantumRNA 18S Internal Standards (Ambion) by using 
HotStarTaq DNA polymerase (Qiagen) according to manufacturer's protocol. The primer 
10 sequences are as follow: forward, 5'-TGC CGC AGC TCA GGA AGA ATG TGT-3' (SEQ 
ID NO:6); reverse, 5 '-CAT CAT TCT GAG TTT TTT GAG AAG-3' (SEQ ID NO:7). 

[0251] siRNA was used to inhibit EpCAM expression. siRNA were synthesized by 
Qiagen. The sense and antisence strands of EpCAM are: sense, S'-GUU UGC GGA CUG 
CAC UUC AdTdT-3' (SEQ ID NO:8); antisense, S'-UGA AGU GCA GUC CGC AAA 

15 CdTdT-3' (SEQ ID NO:9). Non-silencing RNA was purchased from Qiagen and used as 
control siRNA. The sequences of control siRNA were: sense, S'-UUC UCC GAA CGU 
GUC ACG UdTdT-3' (SEQ ID NO: 10); antisense, 5'-ACG UGA CAC GUU CGG AGA 
AdTdT-3' (SEQ ID NO:l 1), Transfection of siRNAs was carried out using TransIT-TKO 
transfection reagent (Mirus) according to the manufacturer's protocol and 200 nM siRNA 

20 duplex per experiment. Cell growth was determined by using Cell Counting Kit-8 (Dojindo 
Molecular Tech.) as described by the manufacturer. The experiments were performed in 
triplicate. 

2. Results 

[0252) Gene expression profiles of liver samples from 59 chronic liver disease (CLD) 
25 patients and of 14 HCC samples were compared to that of a pool of 8 disease-free normal 
liver samples by microarray containing 9128 human cDNA clones (Ye et al., Supra). The 
CLD samples included 7 hepatitis B (HBV), 11 hepatitis C (HCV), 3 hemochromatosis 
(HHC), 5 Wilson's Disease (WD), 10 alcoholic liver disease (ALD), 16 primary biliary 
cirrhosis (PBC) and 7 autoimmune hepatitis (AIH). A supervised univariate F-test algorithm 
30 with 2000 random permutations of the class labels was used to search for genes that can 
discriminate these 7 CLD groups. This analysis yielded a total of 489 significant genes 
(p<0.0005). Hierarchical clustering analysis (as described by Eisen et al., supra) of the 489 
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genes revealed that these 7 liver disease groups were separated into two major branches, one 
consisting mostly of HBV, HCV, HHC, and WD samples and other containing mainly PBC, 
ALD, and AIH samples. These results indicate that HBV, HCV, HHC, and WD are more 
closely related each other than they are as a group to PBC, ALD, or AIH. The segregation of 
5 these samples by a molecular signature specifically reflecting their etiologies was correlated 
coincidentally with their risk to develop HCC, with an exception of WD samples (data not 
shown). To further determine the degree of difference among these groups, a t-test was 
performed based compound covariate predictor analysis among these 7 groups with "leave- 
one-out" cross-validation and 2000 random permutation tests. A total of 21 simulations were 

1 0 performed, which yielded 500 composite genes. The result of the hierarchical clustering of 
these genes is consistent with that of F-test (data not shown). Consistently, PBC, ALD, or 
AIH was more significantlydifferent from HBV, HCV, HHC, or WD, while the differences 
among the etiologies were less significant (data not shown).. It appears that the WD samples, 
at least for this set, belong to the high-risk group. The interpretations from above results are 

1 5 that the molecular signature is dominated by the genes segregating the high risk group from 
the low risk group for their ability to develop HCC while genes reflecting their individual 
etiologies were minuscule. 

[0253] The genes that were commonly disregulated in HBV/HCV/HHC/WD samples but 
not in ALD/PBC/AIH were hypothesized to be more closely related to the molecular 

20 signature of HCC. To search globally for such a gene set, the k-nearest neighbors (K=3) 
(3NN) and support vector machine (S VM) algorithms were applied with a "leave-one-out" 
cross-validation test and 2000 random permutations of class labeling test to the high risk 
(HBV/HCV/HHC/WD) and low risk (ALD/PBC/AIH) groups at a P value <0.001, a 
computation strategy similar to our recent study (Ye et al., supra). This analysis yielded a 

25 composite classifier containing 556 significant genes, which separated these two groups very 
well. It provided a significant class prediction among these groups with an overall accuracy 
of 78% by 3NN and 86% by SVM, respectively, and the cross-validated misclassifi cation 
rates were significantly lower than expected by chance (p<0.0005) (data not shown). 
However, random grouping of these samples yielded statistically insignificant classification 

30 (data not shown). 

[0254] It was noted that many genes in the 556-gene set can be found in the 14 HCC 
samples analyzed (data not shown). To identify genes that were commonly disregulated in 
the high-risk group and in HCC, the 14 HCC samples were pooled together with the high-risk 
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group and then compared with the low risk group using 3NN algorithm at a P value <0.001, 
with 2000 random permutations. This analysis yielded 416 genes, in which 273 genes were 
found in the 556-gene set (49% overlapping). These results indicate that about half of the 
signature genes that can discriminate between the high risk and the low risk groups are 
5 present in HCC samples. To determine if the 273-gene set (Table 5) was a common signature 
for tumors, we applied this set to two independent HCC gene expression profiles using the 
3NN and SVM predictors. One set included 24 HCC samples derived from a comparison 
with the same normal liver control used above and the other set including 50 HCC samples 
that were compared to its matched non-cancerous liver tissues (Ye et al., supra). The 273- 

10 gene signature provided an increased fitness by SVM in their classification with an overall 

accuracy of 92% for the 24 HCC samples and 94% for the 50 HCC samples (data not shown), 
which was improved in overall performance as compared to the 556-gene set. Consistently, 
the non-overlapping 283-gene set did not provide any satisfactory performance. Because 
most of the HCC-associated genes in the non-overlapping gene set were eliminated, most of 

15 the 283 genes may belong to the signatures separating the etiologies. Moreover, the 383 
overlapping genes selected from a comparison of HB V/HCV/HHC/WD and 
ALD/PBC/AIH/HCC did not yield a meaningful classification of the two independent HCC 
sets with an overall predictive rate below 50% (a random event). The 273 genes were 
examined in multiple liver samples taken from two HBV patients and from different parts of 

20 the liver that were spread at least in a 5 cm diameter region. The profiles of these 273 genes 
in different parts of the livers from these two patients were almost identical (data not shown). 
Furthermore, top 25 genes with the lowest parametric p- values (pO.OOOOOl) were selected 
from the 273-gene set. This set gave rise to a comparable result as the 273-gene set (data not 
shown). Taken together, these results indicate that the 273-gene set contains most of the 

25 HCC-associated genes relevant to HCC development and that these genes are widely spread 
in the parenchyma of the affected livers rather than are retained locally. 

[0255] To examine if the 273-gene set is a common signature in other human tumors, the 
gene parameters in this signature were applied using SVM to 98 HCCs, 53 lung cancers, 89 
gastric adenocarcinoma, 37 soft tissue tumors, 39 breast tumors and 23 difuse large B-cell 
30 lymphoma (DLBCL) from several publicly available microarray datasets (Alizadeh et al., 

supra; Perou et al., supra; Garber et al., Proc. Natl Acad. ScL U.S.A. 98:13784-13789, 2001). 
While the 273-gene set consistently performed well with additional 98 HCC samples (80% of 
the samples fit the signature), 97% of breast cancers (39 cases) and 78% of DLBCL cases 
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shared similar signatures. In contrast, most of the tumor samples from lung, soft tissues, and 
stomach showed a very poor fit to this signature (between 6 and 30% of the cases) (data not 
shown). As a control, the 283-gene set (non-HCC-related genes) did not provide a satisfying 
prediction to these samples. Thus, the HCC-associated genes in the classifier appear to be 
5 commonly disregulated in breast cancer and DLBCL, but not in lung adenocarcinoma, soft 
tissue tumors, and gastric adenocarcinoma. 

[0256] Above studies suggested that genes responsible for the genesis of HCC may be 
present in the 273 gene set. For example, the gene whose expression is significantly elevated 
in the high-risk group but not in the low-risk group may act as an oncogene to promote cell 

10 growth. To test this "proof-of-principle" hypothesis, a lead gene at the top of the 273 geneiist 
was selected. This gene was identified as EpCAM or tumor-associated calcium signal 
transducer 1 (TACSTD1, Hs.692), with an average of a 3.6-fold increased expression in the 
high risk group but only a 1 .7 fold in the low risk group (Fig 6a) as well as in HCC (data not 
shown). Elevated expressions of EpCAM in the high-risk CLD samples were verified by the 

1 5 quantitative RT-PCR analysis (Fig 6b). The expression of EpCAM in various HCC cell lines 
was examined by Western blot analysis. EpCAM is highly expressed in Hep3B cells but the 
expression level is relatively low in Huhl and Huh4 cells (Fig 6c), generally correlating with 
their growth rates (Fig 6d). Furthermore, inhibition of EpCAM expression by two different 
siRNA oligos specific to EpCAM resulted in a significant growth inhibition of Hep3B cells 

20 (Fig 6f). In contrast, a control siRNA oligo has no such effect (Fig 6e and data not shown). 
These results indicate that EpCAM may provide oncogenic property by promoting neoplastic 
cell proliferation. 

[0257] The 273 significant genes, their gene symbols, their map positions, and their UG 
Cluster identifiers are presented in Table 5. 
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Gene 
symbol 


DYRK3 






HLF 


C9 


ABCA1 




KIAA0843 


< 

-J 
cl 


MMA) 1 




SIPL 




NAT2 




CD5L 


UG cluster 


Hs.38018 




Hs.406646 


Hs.433707 


Hs.1290 


Hs.211562 




Hs.26777 


Hs.44198 




Hs.64322 


Hs.36102 




Hs.2 


Hs.52002 


Description 


dual-specificity tyrosine- 
s-phosphorylation 
regulated kinase 3 


Similar to hypothetical 
protein PR02831 [Homo 
sapiens], mRNA 
sequence 


hepatic leukemia factor 


complement component 9 


ATP-binding cassette, 
sub-family A (ABC1). 
member 1 


KIAA0843 protein 


intracellular membrane- 
associated calcium- 
independent 

phospholipase A2 gamma 


SIPL protein 


ESTs, Highly similar to 
MT1B HUMAN 
METALLOTHIONEIN-IB 
(MT-1B) [H.sapiens] 


N-acetyltransferase 2 
(arylamine N- 
acetyltransferase) 


CDS antigen-like 


Unique 
id 


160233 


160436 


160795 


161944 


167718 


168437 


162884 


166910 


166192 


164779 


166252 


Geom mean 
of ratios in 
class 2: Low 


1.133 


1.071 


1.382 


0.798 


0.703 


0.912 


1.087 


1.065 


1.003 

... 


0.832 


1.191 


Geom mean 
of ratios in 
class 1: High 


0.616 


0.786 


0.761 


0.314 


0.506 


0.65 J 


0.843 


0.657 


0.544 


0.46 


0.707 


Parametric 
p-value 


p < 0.000001 


p < 0.000001 


p < 0.000001 


p < 0.000001 


p < 0.000001 


p < 0.000001 


p < 0.000001 


p < 0.000001 
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Gene 
symbol 


FLJ 12666 




DKFZP564 


D046 


IMPA1 






CRHBP 




PZP 


SRP54 




INPP5D 




NEDD4 




NDST1 




KANK 




UG cluster 


Hs.23767 




Hs.44197 




Hs.171776 




Hs.422650 


Hs.115617 




Hs.74094 


Hs.49346 




Hs.155939 




Hs.1565 




Hs.20894 




Hs.77546 




Description 


hypothetical protein 
FLJ 12666 


hypothetical protein 
DKFZp564D0462 


inositol(myo)-1(or 4)- 
monophosphatase 1 


ESTs, Weakly similar to 
ARF protein [Homo 
sapiens] [H. sapiens] 


corticotropin releasing 
hormone binding protein 


pregnancy-zone protein 


signal recognition particle 
54kDa 


inositol polyphosphate-5- 
phosphatase, 145kDa 


neural precursor cell 
expressed, 

developmentally down- 
regulated 4 


N-deacetylase/N- 
sulfotransferase (heparan 
glucosaminyl) 1 


kidney ankyrin repeat- 
containing protein 


Unique 
id 
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of ratios in 
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of ratios in 
class 1: High 
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Cr 
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1p31 




14q24 
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3q26.1- 


Gene 
symbol 




ANG 


NAT1 




DO 


PBEF 




GUSB 




ACADM 






MTHFD1 




RNASE4 




BCHE 




UG cluster 


Hs.69771 


Hs.332764 


Hs.155956 




Hs.13776 


Hs.239138 




Hs.183868 




Hs.79158 




Hs.23729 


Hs.1 72665 




Hs.283749 




Hs.1327 




Description 


B-factor, properdin 


angiogenin, ribonuclease, 
RNase A family, 5 


N-acetyltransferase 1 
(arylamlne N- 
acetyltransferase) 


Dombrock blood group 


pre-B-cell colony- 
enhancing factor 


glucuronidase, beta 


acyl-Coenzyme A 
dehydrogenase, C-4 to C- 
12 straight chain 


Homo sapiens clone 
24405 mRNA sequence 


methylenetetrahydrofolate 
dehydrogenase (NADP+ 
dependent), 

methenyltetrahydrofolate 
cyclohydrolase, 
fonnyltetrahydrofolate 
synthetase 


ribonuclease, RNase A 
family, 4 


butyrylcholinesterase 


Unique 
id 


168256 


162472 


167629 


162036 


159972 


160759 


162192 


161636 


168452 


165666 


167394 


Geom mean 
of ratios in 
class 2: Low 


0.865 


1.011 


0.906 


1.231 


0.831 


1.14 


1.284 


1.062 


1.211 


0.906 


0.939 


Geom mean 
of ratios in 
class 1: High 


LO 
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! 0.611 


0.593 


0.884 


0.448 
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Gene 
symbol 




PCCA 




IGFBP1 




PKP2 




PCTP 




ADK 


FGB 




TD02 




ANXA7 




ACMSD 




MFN2 


SGK 




RODH 




UG cluster 




Hs.80741 




Hs. 102122 




Hs.25051 




Hs.285218 




Hs.432422 


Hs.7645 




Hs.1 83671 




Hs,386741 




Hs.1 14088 




Hs.3363 


Hs.296323 




Hs.1 1958 




Description 




propionyl Coenzyme A 
carboxylase, alpha 
polypeptide 


insulin-like growth factor 
binding protein 1 


plakophilin 2 


phosphatidylcholine 
transfer protein 


adenosine kinase 


fibrinogen, B beta 
polypeptide 


tryptophan 2,3- 
dioxygenase 


annexin A7 


aminocarboxymuconate 

semialdehyde 

decarboxylase 


mitofusin 2 


serum/glucocorticoid 
regulated kinase 


3-hydroxysteroid 
epimerase 


Unique 
id 




167501 


165974 


161234 


166532 


167750 


165890 


161362 


159764 


164249 


162711 


160370 


161146 


Geom mean 
of ratios in 
class 2: Low 




0.767 


2.181 


0.933 


1.098 


0.815 


0.766 


i 0.89 


1.044 


0.88 


1.142 


1.391 


0.867 


Geom mean 
of ratios in 
class 1: High 




0.62 


0.809 


0.622 


0.852 


0.567 


0.479 


0.406 


0.739 


0.642 


0.91 


0,784 


0.483 


Parametric 
p-value 




6.80E-05 
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Gene 
symbol 


TRA1 




TLR2 




KIAA0212 






FGL1 




CYB5 




ETFDH 




CYP2C9 




SORD 




SF3B1 






UG cluster 


Hs.82689 




Hs.63668 




Hs.1 54332 




Hs.234898 


Hs.107 




Hs.83834 




Hs.323468 




Hs.1 67529 




Hs.878 




Hs.334826 




Hs.284252 


Description 


tumor rejection antigen 
(9P96) 1 


toll-like receptor 2 


KIAA0212 gene product 


Homo sapiens, clone 
IMAGE:3833472, mRNA, 
mRNA sequence 


fibrinogen-like 1 


cytochrome b-5 


electron-transferring- 

flavoprotein 

dehydrogenase 


cytochrome P450, 
subfamily IIC 
(mephenytoin 4- 
hydroxylase), polypeptide 


sorbitol dehydrogenase 


splicing factor 3b, subunit 
1, 155kDa 


Homo sapiens mRNA; 
cDNA DKFZp76201615 
(from clone 

DKFZp76201615) f mRNA 
sequence 


Unique 
id 


161986 


165670 


166820 


164495 


163893 


167287 


162446 


169375 


160720 


162067 


164393 


Geom mean ' 
of ratios in 
class 2: Low 


0.846 


1.049 


0.78 


0.838 


0.592 


1.058 


1.015 


1.102 


0.963 


1.266 


0.936 


Geom mean 
of ratios in 
class 1: High 


0.476 


0.807 
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0.639 
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p-value 
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Map 


16q22.1 


■J - ' 

v— 

cr 

CM 


2q33 


CM 

CM 
CL 
CO 




CO 
CM 

cr 
X 


CM 
• 

CO 

cr 
o> 


13q12.2 


CM 
CO 

cr 

CO 

cr 

03 


1p32 
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Gene 
symbol 


TAT 


MERTK 




BZW1 




KIAA0062 






col 

LLl 


CDW92 


HSP105B 


ORM1 


— — 

C8A 


DECR1 




GHR 


SEPP1 




CYP4F3 


UG cluster 


Hs.161640 


Hs.306178 




Hs.1 55291 




Hs.89868 




166337 
(IncytePD) 


Hs.79345 


Hs.179902 


Hs.36927 


Hs.572 


Hs.93210 


Hs.81548 




Hs.125180 


Hs.275775 




Hs. 106242 


Description 


tyrosine aminotransferase 


c-mer proto-oncogene 
tyrosine kinase 


basic leucine zipper and 
W2 domains 1 


KIAA0062 protein 


arginase, liver 


coagulation factor VIII, 
procoagulant component 
(hemophilia A) 


CDw92 antigen 


heat shock 105kD 


orosomucoid 1 


complement component 
8, alpha polypeptide 


2,4-dienoyl CoA 
reductase 1 , mitochondrial 


growth hormone receptor 


selenoprotein P, plasma, 


cytochrome P450, 


Unique 
id 


166007 


165559 


165133 


167542 


169449 



167543 


163368 


168931 


165009 


162162 


166110 


161689 


167617 


161484 


Geom mean 
of ratios in 
class 2: Low 


0.896 


1.188 


1.224 


0.522 


0.902 


0.78 


0.61 


• 1.761 


0.687 


0.662 


1.159 


0.985 


1.223 


0,938 


Geom mean 
of ratios in 
class 1: High 


0.418 


0.828 


0.816 


0.334 


0.504 


0.649 


0.491 


1.059 


0.406 


0.37 


0.746 


0749 


0.899 


0.644 


Parametric 
p-value 


0.000205 


0.000219 


0.000221 


0.000223 


0.00023 


0.000231 


0.000235 


0.000244 


0.000245 


0.000264 


0.000265 


0.000277 


0.000282 


0.000291 
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Map 




6q23.2 
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cr 
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CN 

cr 
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Gene 
symbol 




MAP7 




PGM1 










PACE4 




FABP1 




SCP2 




AC01 




PLXNB1 




ft 


HSD17B4 




UG cluster 




Hs.146388 




Hs.1869 




2593385 
(IncytePD) 


1550727 
(IncytePD) 


Hs.306359 


Hs.170414 




Hs.380135 




1 Hs.75760 




Hs.154721 




Hs.278311 




I Hs.396489 


Hs.75441 




Description 


subfamily IVF, polypeptide 
3 (leukotriene 84 omega 
hydroxylase) 


microtubule-associated 
protein 7 


phosphoglucomutase 1 


Incyte EST 


L-3-hydroxyacyl- 
Coenzyme A 
dehydrogenase, short 
chain 


Homo sapiens clone 
25038 mRNA sequence 


paired basic amino acid 
cleaving system 4 


fatty acid binding protein 
1 , liver 


sterol carrier protein 2 


aconitase 1 , soluble 


plexin B1 


transferrin 


hydroxysteroid (17-beta) 
dehydrogenase 4 


Unique 
id 




167551 


169703 


163040 


165566 


162707 


166674 


165737 


168366 


165115 


161732 


162202 


167991 


Geom mean 
of ratios in 
class 2: Low 




1.172 


0.895 


0.909 


0.807 


1.192 


0.652 


1.378 


0.86 


1.044 


1.152 


1.28 


0.886 


Geom mean 
of ratios in 
class 1: High 




0.91 


0.604 


0.673 


0.602 


1.017 


0.423 


0.732 


0.596 


0.809 


0.718 


0.854 


0.553 


Parametric 
p-value 




0.000298 


0.000299 


soeoooo 


0.000311 


0.000322 


0.000322 


0.000327 


0.000334 


0.000334 


0.000389 


0.000349 


1 0.000361 
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Map 
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13q14.3 I 
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cr 
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CO 
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cr 


10p12 


CM 

cr 

CM 


Xq22.2 


Gene 
symbol 


PGRMC1 




SLC27A2 




CAT 


LCMT 




LCP1 




HADHB 






UK114 




DC2 


CACNB2 




IL18R1 


SERPINA7 




UG cluster 


Hs.90061 




Hs.1 1729 




Hs.395771 


Hs.8054 




Hs.381099 




Hs.146812 




Hs.426542 


Hs.18426 




Hs.103180 


Hs.30941 




Hs.159301 


Hs.76838 




Description 


progesterone receptor 
membrane component 1 


solute carrier family 27 
(fatty acid transporter), 
member 2 


catalase 


leucine carboxyl 
methyltransferase 


lymphocyte cytosolic 
protein 1 (L-plastin) 


hydroxyacyl-Coenzyme A 
dehydrogenase/3- 
ketoacyl-Coenzyme A 
thiolase/enoyl-Coenzyme 
A hydratase (trifunctional 
protein), beta subunit 


EST 


translational inhibitor 
protein p14.5 


DC2 protein 


calcium channel, voltage- 
dependent, beta 2 subunit 


interleukin 18 receptor 1 


serine (or cysteine) 
proteinase inhibitor, clade 


Unique 
id 


169717 


165457 


164532 


162934 


160051 


168394 


162323 


160471 


163224 


162773 


166579 


161872 


Geom mean 
of ratios in 
class 2: Low 


0.953 


CO 
CO 

o 


1.101 


1.28 


0.822 


0.97 


1.164 


1.067 


0.823 


1.308 




1.113 


Geom mean 
of ratios in 
class 1: High 


0.662 


0.554 


0.687 


0.969 


0.583 


0.701 


0.964 


0.689 


0.624 


0.998 


0.88 


0.665 


Parametric 
p -value 


0.000365 


0.000367 
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CO 

o 
o 
o 

d 


0.000401 
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0.000394 


0.000411 
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0.00046 
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Map 




CM 

cr 
• 

CD 
CM 

s 


CM 

cr 

CO 


7q11.21 


CM 
CO 
cr 
^J* 


5» ^ 

crco 
i- cr 
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CM 
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CO 
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Gene 
symbol 




LPA 


HPS3 




TPST1 




KIAA1450 




RAB3IL1 




CYP2J2 




POU1F1 




GTF2B 




GTF2E2 




RAB9P40 


PLG 


UG cluster 




Hs. 119520 


Hs.282804 




Hs.421194 




Hs.83243 




Hs.13759 




Hs. 152096 




Hs.89394 




Hs.258561 




Hs.77100 




Hs.19012 


Hs.75576 


Description 


A (alpha- 1 antiproteinase, 
antitrypsin), member 7 


lipoprotein, Lp(a) 


Hermansky-Pudlak 
syndrome 3 


tyrosylprotein 
sulfotransferase 1 


KIAA1450 protein 


RAB3A interacting protein 
(rabin3)-like 1 


cytochrome P450, 
subfamily IIJ (arachidonic 
acid epoxygenase) 
polypeptide 2 


POU domain, class 1 , 
transcription factor 1 (Pit1, 
growth hormone factor 1) 


general transcription 
factor IIB 


general transcription 
factor HE, polypeptide 2, 
beta 34kDa 


Rab9 effector p40 


plasminogen 


Unique 
id 




162012 


163509 


165011 


164314 


162882 


165530 


166057 


167868 


167779 


165329 


166857 


Geom mean 
of ratios in 
class 2: Low 




1.159 


1.179 


0.732 


0.875 


1.054 


1.162 


0.924 


1.228 


1.096 


1.225 




Geom mean 
of ratios in 
class 1: High 




0.659 


0.859 


0.523 


0.649 


0.935 


0.769 


0.487 


0.95 


0.942 


0.947 


0.62 


Parametric 
p-value 




0.000467 


0.000469 


0.000532 


0.000577 


0.000582 


0.000636 


0.000679 


0.000703 


0.000706 


0.000727 


0.000735 
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Map 
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Gene 
symbol 


KCNJ8 






FLJ21918 




ETFA 




SAT 


RREB1 




TMOD 




SORD 


LOC57149 




PCK2 




UG duster 


Hs.102308 




604856 
(IncytePD) 


Hs.282093 




Hs. 169919 




Hs.28491 


Hs.171942 




Hs.374849 


Hs.32699 




Hs.878 


Hs.28607 




Hs.75812 




Description 


potassium inwardly- 
rectifying channel, 
subfamily J, member 8 


nicotinamide N- 
methyltransferase 


hypothetical protein 
FLJ21918 


electron-transfer- 
flavoprotein, alpha 
polypeptide (glutaric 
aciduria il) 


spermidine/spermine N1- 
acetyltransferase 


ras responsive element 
binding protein 1 


tropomodulin 


Similar to RIKEN cDNA 
1810013D05gene [Homo 
sapiens], mRNA 
sequence 


sorbitol dehydrogenase 


hypothetical protein A- 
211C6.1 


phosphoenolpyruvate 
carboxykinase 2 
(mitochondrial) 


Unique 
id 


165788 


167386 


163088 


167385 


169569 


160982 


166818 


164368 


160667 


160956 


166778 


Geom mean 
of ratios in 
class 2: Low 


1.215 


1.001 


0.802 


1.166 


1.459 


1.362 


0.967 


1.057 


1.001 


0.894 


0.933 


Geom mean 
of ratios in 
class 1: High 


0.838 


0.662 


0.671 


0.795 


1.068 


1.04 


0.756 


0.79 


0.609 


0.713 


0.625 


Parametric 
p-value 


0.000775 


0.000778 


0.000795 


0.00079 


0.000799 


0.000812 


0.00083 



0.000844 


0.000848 


0.000851 


0.000858 


t-value 
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Map 


17q21.1 


19q13.42 




CO 


cr 

CN 


11q13-q14 


CO 
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p21.1 


• 

CTTJ- 

O CN 

cr 


9q22.33 


Gene 
symbol 


WIRE 


EPS8R1 








KIM1641 


PAK1 






POLD1 




H2AFA 




MRPL43 




TXNDC4 




UG cluster 


Hs. 13996 


Hs.28907 




Hs. 119629 


Hs.1 94441 


Hs.44566 


Hs.64056 




Hs.142907 


Hs.99890 




Hs.121017 




Hs.151945 




Hs.1 54023 




Description 


WIRE protein 


epidermal growth factor 
receptor pathway 
substrate 8-related protein 


ESTs, Moderately similar 
to hypothetical protein 
FLJ20234 {Homo sapiens] 
[H.sapiens] 


ESTs 


KIAA1641 protein 


p21/Cdc42/Rad- 
activated kinase 1 (STE20 
homolog, yeast) 


Human BRCA2 region, 
mRNA sequence CG011 


polymerase (DNA 
directed), delta 1 , catalytic 
subunit 125kDa 


H2A histone family, 
member A 


mitochondrial ribosomal 
protein L43 


thioredoxin domain 
containing 4 (endoplasmic 
reticulum) 


Unique 
id 


165413 


166348 


163115 


163579 


161090 


161354 


162677 


161085 


161518 


163109 


164845 


Geom mean 
of ratios in 
class 2: Low 


0.953 


0.941 


© 


0.963 


1.306 


0.981 


0.82 


0.653 


0.985 


1.062 


0.841 


Geom mean 
of ratios in 
class 1: High 




1.08 


1.048 


1.124 


1.835 


1.173 


0.947 


0.853 


1.141 


1.232 


1.014 


Parametric 
p-value 


0.000822 


0.000759 


0.000725 


0.000684 


0.00068 


0.000688 


0.000661 


0.000582 


0.000572 


0.000571 


0.000537 


t-value 


3.53 


3.56 


3.57 


3.59 
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co 
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Map 


7q21-q22 


oo 


CO 


19q13.3 


CM 

5 5 

cr co 

CM t- 
CM CT 


7q11.23 


CM 


11q13 


18p11.21 


2p22-p21 




19q13.33 


Gene 
symbol 


AKAP9 








20D7-FC4 




TOB2 




CLDN4 






MEN1 




KIAA0874 




MSH2 






NUP62 


UG cluster 


Hs.58103 




Hs. 125038 


Hs.278483 


Hs.128702 i 




Hs.4994 




Hs.5372 




Hs.143992 


Hs.423348 




Hs.27973 




Hs.78934 




3031912 
(IncytePD) 


Hs.9877 


Description 


A kinase (PRKA) anchor 
protein (yotiao) 9 


ESTs 


H4 histone family, 
member A [Homo 
sapiens], mRNA 
sequence 


hypothetical protein 20D7- 
FC4 


transducer of ERBB2, 2 


claudin 4 


ESTs, Moderately similar 
to hypothetical protein 
FLJ20378 [Homo sapiens] 
[H. sapiens] 


multiple endocrine 
neoplasia I 


KIAA0874 protein 


mutS homolog 2, colon 
cancer, nonpolyposis type 
1 (E. coli) 


Incyte EST 


nucleoporin 62kDa 


Unique 
id 


162564 


164727 


161620 


161334 


163536 


162152 


169742 


161058 


161813 


168511 


161873 


169310 


Geom mean 
of ratios in 
class 2: Low 


0.528 


1.059 


0.781 


0.854 


0.871 


1.018 


0.936 


1.035 


0.619 


5 
^— 


1.031 


1.002 


Geom mean 
of ratios in 
class 1: High 


0.677 


1.301 


0.898 


0.976 


1.081 


1.376 


1.138 


1.234 


0.765 


1.227 


1.265 


1.174 


Parametric 
p-value 


0.000538 


0.000522 


0.00052 


0.000484 


0.000428 


0.000411 


0.00033 


0.000311 


0.000311 


0.000311 


0.000309 


0.000263 


t-value 


3.67 


3.68 


3.68 


CO 


CO 


3.77 


3.83 


3.84 


3.84 


3.84 


3.84 


3.89 
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Map 


11p11.2 


CM 
CM 

cr 




5p15.33 


CTi 


1q32.2 


7p15 


CO 
CM 
CM 
O" 
CO 
r— 


ID 


Gene 
symbol 


MDK 


FLJ 11 280 






TRIP13 






ELF3 




MPP6 




FLJ 10520 






UG cluster 


Hs.82045 


Hs.3346 




Hs.82845 


Hs.6566 




Hs.384561 


Hs. 166096 




Hs.108931 




Hs.77510 




Hs.172129 




Description 


midkine (neurite growth- 
promoting factor 2) 


hypothetical protein 
FLJ 11280 


Homo sapiens cDNA: 
FLJ21930 fis, clone 
HEP04301 , highly similar 
to HSU90916 Human 
clone 2381 5 mRNA 
sequence 


thyroid hormone receptor 
interactor 13 


Homo sapiens full length 
insert cDNA clone 
ZC18H06, mRNA 
sequence 


E74-like factor 3 (ets 
domain transcription 
factor, epithelial-specific ) 


membrane protein, 
palmitoylated 6 (MAGUK 
p55 subfamily member 6) 


hypothetical protein 
FLJ10520 


Homo sapiens cDNA: 
FLJ21409 fis, clone 
COL03924, mRNA 
sequence 


Unique 
id 


168933 


163495 


168500 


168246 


164713 


169559 


164262 


161661 


163071 


Geom mean 
of ratios in 
class 2: Low 


1.521 


0.893 


0.557 


0.984 


0.914 


0.862 


0.906 


0.998 


0.926 


Geom mean 
of ratios in 
class 1: High 


2.07 


1.058 


0.749 


1.258 


1.087 


1.324 


1.138 


1.187 


1.15 


Parametric 
p-value 


0.000259 


0.000213 


0.000209 


0.000207 


0.000199 


0.000168 


0.000162 


0.000164 


0.000159 


t-value 


CD 
CO 


3.96 


3.96 


3.96 


3.98 


4.03 


4.04 


4.04 


4.05 




CO 
CM 


m 
to 

CM 


CD 
CO 
CM 
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CO 
CM 


CO 
CO 
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Map 


17q25.2 


CM 


CO 
CM 

cr 
uS 

CM 

cr 
X 


CO 
CO 
Q. 


M" 
CM 

cr 


CO 
CM 

i— 

cr 


2q23.3 




17q11.1 


CM 


cn 




3q26.3 


CO 
r— 

cr 

CO 


Gene 
symbol 


KIAA0195 






FLJ11362 




HD 


NRGN 




CLDN4 




FNBP3 






KIAA1361 








FU39514 




PRKCI 




SNRPA 




UG cluster 


Hs.301132 




Hs. 107845 


Hs.8929 




Hs.79391 


Hs.26944 




Hs.5372 




Hs.107213 




1510581 
(IncytePD) 


Hs.15119 




Hs.279482 


Hs.340316 


Hs.48565 




Hs.1904 




Hs.173255 




Description 


KIAA0195 gene product 


ESTs 


hypothetical protein 
FU 11362 


huntingtin (Huntington 
disease) 


neurogranin (protein 
kinase C substrate, RC3) 


claudin 4 


formin binding protein 3 


p53-responsive gene 5 


KIAA1361 protein 


ESTs 


Homo sapiens cDNA 
FLJ34031 fis, clone 
FCBBF2003895, mRNA 
sequence 


hypothetical protein 
FLJ39514 


protein kinase C, iota 


small nuclear 


Unique 
id 


165465 


164085 


166229 


166228 


169583 


160913 


168965 


166849 


167919 


166837 


168977 


166408 


167009 


1 168029 


Geom mean 
of ratios in 
class 2: Low 


0.922 


1.023 


1.049 


0.872 


0.427 


1.01 


0.844 


oo 

00 

o 


0.816 


0.977 


0.974 


0.93 


0.952 


1.011 


Geom mean 
of ratios in 
class 1: High 


1.058 


1.265 


1.301 


1.027 


0.614 


1.37 


1.063 


1.154 


1.021 


1,219 


1.278 


1.06 


' 1.233 


1.237 


Parametric 
p-value 


0.000126 


0.000117 


0.000119 


0.000102 


9.10E-05 


6.80E-05 


6.60E-05 


5.80E-05 


5.30E-05 


4.00E-05 


4.10E-05 


3.50E-05 


3.50E-05 


I 2.40E-05 


t-value 


4.12 


4.14 | 


4.14 


4.18 
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Map 




12p12.1 


15q22.33 
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CD 
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CO co 
X cr 
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T— 
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Gene 
symbol 




KRAS2 




FLJ00005 


LOC57146 




PDPK1 




PPP1R12A 




DKFZP564 


K032 


TACSTD1 




ATP7A 




FLJ22548 






UG cluster 




Hs.433714 




Hs.367690 


Hs.27191 




Hs.1 54729 




Hs.16533 




Hs.97876 




Hs.692 




Hs.606 




Hs. 103267 




Hs.99398 


Description 


ribonucleoprotein 
polypeptide A 


v-Ki-ras2 Kirsten rat 
sarcoma 2 vira! oncogene 
homolog 


FLJ00005 protein 


hypothetical protein from 
clone 24796 


3-phosphoinositide 
dependent protein kinase- 


protein phosphatase 1 , 
regulatory (inhibitor) 
subunit 12A 


hypothetical protein 
DKFZp564K0322 


tumor-associated calcium 
signal transducer 1 


ATPase, Cu++ 
transporting, alpha 
polypeptide (Menkes 
syndrome) 


hypothetical protein 
FLJ22548 similar to gene 
trap PAT 12 


ESTs, Weakly similar to 


Unique 
id 




169587 


163235 


161066 


165515 




169403 


169490 


160089 


169508 


163214 


168509 


Geom mean 
of ratios in 
class 2: Low 




0.632 


0.974 


0.953 


0.784 


0.775 


0.951 


1.727 


0.889 


1.137 


0.89 


Geom mean 
of ratios in 
class 1: High 




0.838 


1.425 


1.153 


0.967 


1.035 


1.14 


CD 
CO 


1.055 


1.478 


1.12 


Parametric 
p-value 




2.40E-05 


2.40E-05 


2.20E-05 


1.90E-05 


1.90E-05 


1.60E-05 


S 

LU 
O 
CO 
r— 


1.50E-05 


1.20E-05 


3.00E-06 


t-value 




4.61 


4.61 


4.63 


4.67 


4.67 


4.71 


4.71 


4.73 


CO 


5.16 
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Map 




15q21.2 




13q14.3 


CO 


CT 


12p11.21 


CO 
CO 

^* 

1 

CM 
O" 

in 


Gene 
symbol 




FLJ13213 






PCDH17 






LOC91875 




KIAA1557 




ENC1 




UG cluster 




Hs.331328 




1602194 
(IncytePD) 


Hs.106511 




Hs.171553 


Hs. 102480 




Hs.6185 




Hs.104925 




Description 


KHLXJ-IUMAN Keich-like 
protein X [H.sapiens) 


hypothetical protein 
FLJ13213 


Incyte EST 


prolocadherin 17 


Homo sapiens clone 
24630 mRNA sequence 


hypothetical protein 
BC008647 


KIAA1557 protein 


ectodermal-neural cortex 
(with BTB-like domain) 


Unique 
id 




166434 


161233 


167498 


160943 


165379 


167992 


166068 


Geom mean 
of ratios in 
class 2: Low 




0.668 


0.929 


0.963 


0.896 


1.012 


0.962 


0.824 


Geom mean 
of ratios in 
class 1: High 




0.855 


CO 


CO 


5 

CM 


lO 

CO 


CO 

r- 

co 


CO 
CM 
CM 


Parametric 
p-value 




3.00E-06 


1.00E-06 


p < 0.000001 


p < 0.000001 


p < 0.000001 


p < 0.000001 


p< 0.000001 


t-value 




5.17 


5.37 


5.55 


5.99 


6.36 


6,36 


6,37 
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[0258] The top 25 genes with the lowest parametric p-values (p<0.000001) were selected 
from the 273-gene set and this set gave rise to a comparable result as the 273-gene set. These 
25 genes significant for indicating a liver disease patient's risk of developing HCC, their gene 
symbols, their map positions, and their UG Cluster identifiers are presented in Table 6. A 
further set of 10 significant genes for predicting the risk of developing HCC in a patient 
suffering from a severe liver disease has been determined in a similar manner and is 
presented in Table 7. 
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1q32 
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CM 
CM 
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i— 
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TT 
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9q31.1 


CM 
CO 
cr 
m 


CO 

cr 


Gene symbol 




KIAA0092 




CGI-26 




DYRK3 






HLF 


! C9 


ABCA1 




KIAA0843 


IPLA2(GAM 


<1 


UG cluster 




Hs.151791 




Hs.24332 




Hs.38018 




Hs.406646 


Hs.433707 


Hs.1290 

i 


Hs.211562 




Hs.26777 


Hs.44198 




Description 


long chain 


KIAA0092 gene 
product 


CGI-26 protein 


dual-specificity 
tyrosine-(Y)- 
phosphorylation 
regulated kinase 3 


Similar to 

hypothetical protein 
PR02831 [Homo 
sapiens], mRNA 
sequence 


hepatic leukemia 
factor 


complement 
component 9 


ATP-binding 
cassette, sub- 
family A (ABC 1), 
member 1 


KIAA0843 protein 


intracellular 
membrane- 
associated calcium- 
independent 
phospholipase A2 
gamma 


Unique id 




163874 


163096 


160233 


160436 


160795 


161944 


167718 


168437 


162884 


Geom mean 
of ratios in 
class 2: Low 




1.181 


0.925 


1.133 


1.071 


1.382 


0.798 


0.703 


0.912 


1.087 


Geom mean 
of ratios in 
class 1: High 




0.864 


0.728 


0.616 


0.786 


0.761 


0.314 


0.506 


0.65 


0.843 


%CV 
support 




o 
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Parametric 
p-value 




0.0000001 


0.0000001 


1 0000000 
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t- 

value 




-5.9 


-5.88 


CO 

iri 


-5.67 


m 

CD 

in 
i 


CO 

in 


-5.6 


-5.58 


-5.57 








CM 

T- 


CO 




in 


CD 

T- 




00 


cn 



121 



BNSDOCID: <WO 03087766A2_l_> 



WO 03/087766 



PCT/US03/10783 



a 

to 



o 

E 

>* 
</> 

Q> 
C 
0) 

o 



Q> 
O 

O 



c 
o 

Q- 
u. 
U 
10 
0) 

Q 



a> 

3 

or 



C 5 

CO c o 

o — -I 

C (A 

C O CM 

E ~ u> 

o 2 jj> 

O o o 



S 

w c - 
o — X 

o 2 «2 
O o o 



> S. 

U CL 



E = 

2 5 

<o ? 

CL CL 



3 



to 

CM 
CL 
CM 



CL 

CO 



CM 
CM 

CO 

CD 
CO 

X 



a) 
o 
o. 

_j 
Q_ 



+ 5 



to 

CO 

o 



o 
o 



o 
o 
o 
o 
o 
o 



00 
in 



cr 

CO 



X 
Q 
O 
CL 



ID 
CD 
O 

CO 

X 



■D 
CO 

o 
o 

CL 



0 


CO 


CO 




CO 




co 




CO 


CD 




0 


CO 


CO 


CO 









in 


CO 


CO 


CO 


CO 


co 


p 


co 


00 




0 


O 



co 



o 
o 



o 
o 
o 
o 
o 
o 



m 
m 



co 

CO 
CM 



co 
in 
in 



CO 

X 



a: 

0) o 
CO 

CL^q q) 

W CM C 

o o §f 

X o to 



o 

CM 



O 

o 



o 
o 
o 
o 
o 
o 



CO 
CO 



o 

CM 



in 

00 

03 
O 

o 



o 
co 

CM 

o 

co 
X 



c 

s 

CL 

a> co 
.c co 

x: CD 



CM 
O 



m 
co 



o 
o 



o 
o 
o 
o 
o 
o 



CO 

co 

CO 



CL 
CM 



in 
m 



in 
co 

CD 
CO 

X 



c 

3 
o 

CL 

m 
m 

1 



CO 


CM 


CO 




CO 


CO 


CO 


CO 


0 


in 




CO 


CD 


CO 


CO 


T— 







CM 
CO 
CO 

O 



CO 

co 



o 
o 



o 
o 
o 
o 
o 
o 



CO 
CO 

CO 



CM 
CM 



CM « 

T- CO 

cr y- 
m cr 



O 
2 
ID 



m 

CM 

CT) 

S 

CO 

X 



CO • 

■5 =s -i 

5 -c a) 
o o 

Q) O := 



"3- 

CM 
CO 

o 



CO 
CM 
CM 



o 
o 



o 
o 
o 
o 
o 
o 
o 



CO 
CO 



CO 

co 
> 

CL 
O 

*C 
g 

2 

a 

CL 

to 

g 

CO 

O 

<D 

si 
x> 

o 

CO 

cn 

<D ■ 

00 § 

CO O 
CN O 

<D ° 
co 
a> 

H 



o 

V 

3 



122 



BNSDOCID: <WO 03087766A2_I_> 



WO 03/087766 



PCT/US03/10783 



e 
o 

♦ 

U 

1 

CL 

CD 

c 



03 



"C 
o 



=3 

s 

oo 
c 

CX 

E 

o 
o 



to 

<u 
o 

CD 

c 

CO 
CD 



> 



CD 

c3 

C 

CD 

£ 
o 

CD 
> 

CD 

-a 
CJ 

o 

a 



CD 



oo 

CD 

CD 

g 

o 

I 

CO 

o 



CD CD 

II 
H E 



Map 


CN 
CO 

cr 
i 

T» 

CO 

cr 


CO 

CM _ 
CM T 

or co 

CM 

t- cr 


r— 

lO 
CM 

cr 
i 

CO 
CM 

cr 


CO 


6p21.3- 
21.2 


CN 

cr 

CO 


5 <n 

CM CD 

cr cm 
co cr 


CM 
CM 

cr 

CD 


10q24 


Gene 
symbol 


TD02 




ACAT1 
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WHAT IS CLAIMED IS: 

1 1 A method for identifying potential therapeutic targets for inhibiting 

2 metastasis in a patient suffering from hepatocellular carcinoma (HCC), comprising the steps 

3 of: 

4 a) contacting an array comprising capture reagents for a set of cellular 

5 markers with a sample from a metastatic HCC patient; 

6 b) capturing markers from the sample and generating a first signal; 

7 c) repeating steps a) and b) with a sample from a non-metastatic HCC patient 

8 and thereby generating a second signal; and 

9 d) comparing the first and second signals and thereby identifying a subset of 

10 cellular markers whose level is different in the first and second signals, wherein the subset of 

1 1 cellular markers are potential therapeutic targets for treating HCC metastasis in an HCC 

12 patient. 

1 2. The method of claim 1, wherein a signal generated from a normal non- 

2 cancerous sample on an array identical to the array of step a) is subtracted in steps b) and c) 

3 to generate the first and second signals. 

1 3. A method for predicting the metastatic potential in a patient suffering 

2 from hepatocellular carcinoma (HCC), comprising the steps of: 

3 a) contacting an array comprising capture reagents for a set of cellular 

4 markers with a sample from a metastatic HCC patient, the set of cellular markers comprising 

5 at least ten genes or proteins encoded by genes independently selected from the genes of 

6 Table 2; 

7 b) capturing markers from the sample; 

8 c) generating a first signal from the captured markers of step b); 

9 d) repeating steps a) to c) with a sample from a non-metastatic HCC patient 

10 and thereby generating a second signal; 

1 1 e ) repeating steps a) to c) with a sample from an HCC patient with unknown 

12 metastatic potential and thereby generating a third signal; and 

1 3 f) comparing the third signal to the first and the second signals and thereby 

14 determining the metastatic potential of the HCC patient of step e). 
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1 4 . The method of claim 3, wherein the set of cellular markers comprises 

2 at least 20 genes or proteins encoded by genes independently selected from the genes of 

3 Table 2. 

1 5. The method of claim 4, wherein the set of cellular markers comprises 

2 at least 50 genes or proteins encoded by genes independently selected from the genes of 

3 Table 2. 

1 6. The method of claim 5, wherein the set of cellular markers comprises 

2 at least 1 00 genes or proteins encoded by genes independently selected from the genes of 

3 Table 2. 

1 7. The method of claim 6, wherein the set of cellular markers comprises 

2 the genes or proteins encoded by genes of Table 2. 

1 8. The method of claim 3, wherein the set of cellular markers comprises 

2 the genes or proteins encoded by genes of Table 4. 

1 9. The method of claim 3, wherein the set of cellular markers comprises 

2 the genes or proteins encoded by genes of Unigene numbers Hs.313, Hs. 69707, Hs.222, 

3 Hs.63984, Hs.75573, Hs.177687, Hs.69707, Hs.222, Hs.323712, and Hs.63984. 

1 10. The method of claim 3, wherein the sample of steps a) and b), the 

2 sample of step d), and the sample of step e) are liver tissue extracts. 

1 11. The method of claim 3, wherein the array of step a) is a genomic array. 

1 12. The method of claim 3, wherein the array of step a) is a proteomic 

2 array. 

1 13. A method for identifying potential therapeutic targets for preventing 

2 hepatocellular carcinoma (HCC) in a patient suffering from a chronic liver disease, 

3 comprising the steps of: 

4 a) contacting an array comprising capture reagents for a set of cellular 

5 markers with a sample from a patient with a chronic liver disease and a high risk of 

6 developing HCC; 

7 b) capturing markers from the sample and generating a first signal; 
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8 c) repeating steps a) and b) with a sample from a patient with a chronic liver 

9 disease and a low risk of developing HCC and thereby generating a second signal; and 

10 d ) comparing the first and second signals and thereby identifying a subset of 

1 1 cellular markers whose level is different in the first and second signals, wherein the subset of 

1 2 cellular markers are potential therapeutic targets for preventing HCC in a patient with a 

1 3 chronic liver disease. 

1 14. The method of claim 13, wherein a signal generated from a normal 

2 non-canerous sample on an array identical to the array of step a) is subtracted in steps b) and 

3 c) to generate the first and second signals. 

1 15. A method for predicting the risk of developing hepatocellular 

2 carcinoma (HCC) in a patient suffering from a chronic liver disease, comprising the steps of: 

3 a) contacting an array comprising capture reagents for a set of cellular 

4 markers with a sample from a patient with a chronic liver disease and a high risk of HCC, the 

5 set of cellular markers comprising at least ten genes or proteins encoded by genes 

6 independently selected from the genes of Table 5; 

7 b) capturing markers from the sample; 

8 c) generating a first signal from the captured markers of step b); 

9 d) repeating steps a) to c) with a sample from a patient with a chronic liver 

10 disease and a low risk of HCC and thereby generating a second signal; 

1 1 e ) repeating steps a) to c) with a sample from a patient with a chronic liver 

12 disease and an unknown risk of HCC and thereby generating a third signal; and 

1 3 0 comparing the third signal to the first and the second signals and thereby 

1 4 determining the risk of developing HCC in the patient of step e). 

1 1 6. The method of claim 15, wherein the set of cellular markers comprises 

2 at least 20 genes or proteins encoded by genes independently selected from the genes of 

3 TableS. 

1 17. The method of claim 16, wherein the set of cellular markers comprises 

2 at least 50 genes or proteins encoded by genes independently selected from the genes of 

3 Table5. 
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1 18. The method of claim 17, wherein the set of cellular markers comprises 

2 at least 100 genes or proteins encoded by genes independently selected from the genes of 

3 Table 5. 

1 19. The method of claim 18, wherein the set of cellular markers comprises 

2 the genes or proteins encoded by genes of Table 5. 

1 20. The method of claim 15, wherein the set of cellular markers comprises 

2 the genes or proteins encoded by genes of Table 6. 

1 21. The method of claim 15, wherein the set of cellular markers comprises 

2 the genes or proteins encoded by genes of Table 7. 

1 22. The method of claim 15, wherein the sample of steps a) and b), the 

2 sample of step d), and the sample of step e) are liver tissue extracts. 

1 23. The method of claim 15, wherein the array of step a) is a genomic 

2 array. 

1 24. The method of claim 15, wherein the array of step a) is a proteomic 

2 array. 

1 25. The method of claim 15, wherein the patient of step a) suffers from a 

2 disease selected from the groups consisting of hepatitis B, hepatitis C, hemachromatosis, and 

3 Wilson's disease. 

1 26. The method of claim 15, wherein the patient of step d) suffers from 

2 alcoholic liver disease, autoimmune hepatitis, or primary biliary cirrhosis. 

1 27. The method of claim 15, wherein the patient of step e) suffers from a 

2 disease selected from the group consisting of hepatitis B, hepatitis C, hemochromatosis, 

3 Wilson's disease, alcoholic liver disease, autoimmune hepatitis, and primary biliary cirrhosis. 

1 28. A computer readable medium comprising: 

2 a) code for a first data set, derived from a first signal from an array 

3 comprising capture reagents for a set of cellular markers after contact with a sample from a 
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4 metastatic HCC patient, the set of cellular markers comprising at least 10 genes or proteins 

5 encoded by genes independently selected from the genes of Table 2; 

6 b) code for a second data set, derived from a second signal from an array 

7 identical to the array of a) after contact with a sample from a non-metastatic HCC patient; 

8 c) code for a third data set, derived from a third signal from an array identical 

9 to the array of a) after contact with a sample from a HCC patient with unknown metastatic 

10 potential; and 

1 1 d) code for comparing the third data set with the first and second data sets. 

1 29. A digital computer comprising the computer readable medium of claim 

2 28. 

1 30. A system comprising: 

2 a) a digital computer of claim 29; 

3 b) a chip with an array comprising capture reagents for a set of cellular 

4 markers comprising at least 10 genes or proteins encoded by genes independently selected 

5 from the genes of Table 2; and 

6 c) a reader capable of registering a signal from the array after contact with a 

7 sample. 

1 31 A computer readable medium comprising: 

2 a) code for a first data set, derived from a first signal from an array 

3 comprising capture reagents for a set of cellular markers after contact with a sample from a 

4 patient with a chronic liver disease and a high risk of HCC, the set of cellular markers 

5 comprising at least 10 genes or proteins encoded by genes independently selected from the 

6 genes of Table 5; 

7 b) code for a second data set, derived from a second signal from an array 

8 identical to the array of a) after contact with a sample from a patient with a chronic liver 

9 disease and a low risk of HCC; 

10 c ) code for a third data set, derived from a third signal from an array identical 

1 1 to the array of a) after contact with a sample from a patient with a chronic liver disease and 

12 an unknown risk of HCC; and 

1 3 d ) code for comparing the third data set with the first and second data sets. 
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1 32. A digital computer comprising the computer readable medium of claim 

2 31. 

1 33. A system comprising: 

2 a) a digital computer of claim 32; 

3 b) a chip with an array comprising capture reagents for a set of cellular 

4 markers comprising at least 10 genes or proteins encoded by genes independently selected 

5 from the genes of Table 5; and 

6 c) a reader capable of registering a signal from the array after contact with a 

7 sample. 

1 34. A method for inhibiting hepatocellular carcinoma (HCC) metastasis in 

2 a patient suffering from HCC, the method comprising the step of suppressing osteopontin 

3 (OPN) activity. 

1 35. The method of claim 34, wherein the step of suppressing osteopontin 

2 (OPN) activity is accomplished by inhibiting OPN expression. 

1 36. The method of claim 35, wherein an antisense polynucleotide is used 

2 to inhibit OPN expression. 

1 37. The method of claim 34, wherein the step of suppressing osteopontin 

2 (OPN) activity is accomplished by inhibiting the specific binding between OPN and OPN 

3 receptor. 

1 38. The method of claim 37, wherein an OPN antagonist is used to inhibit 

2 the specific binding between OPN and OPN receptor. 

1 39. The method of claim 37, wherein an anti-OPN antibody is used to 

2 inhibit the specific binding between OPN and OPN receptor. 

1 40. A method for inhibiting the development of hepatocellular carcinoma 

2 (HCC) in a patient suffering from a chronic liver disease, comprising the step of suppressing 

3 EpCAM activity. 

1 41 . The method of claim 40, wherein the step of suppressing EpCAM 

2 activity is accomplished by inhibiting EpCAM expression. 
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1 42. The method of claim 41, wherein an antisense polynucleotide is used 

2 to inhibit EpCAM expression. 

1 43. The method of claim 41, wherein a small inhibitory RNA is used to 

2 inhibit EpCAM expression. 

1 44. The method of claim 40, wherein the step of suppressing EpCAM 

2 activity is accomplished by inhibiting the specific binding between EpCAM and EpCAM 

3 receptor. 

1 45. The method of claim 44, wherein an anti-EpCAM antibody is used to 

2 inhibit the specific binding between EpCAM and EpCAM receptor. 
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Figure 2 
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Figure 3 
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